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1. Introduction. Let (Xk)k>i be a sequence of random variables whose dis- 
tribution /* lies in one of a nested family of models (M q ) q >i, indexed (and or- 
dered) by the integers. We define the model order as the smallest index q* such that 
the true distribution /* lies in the corresponding model class. The model order typ- 
ically determines the most parsimonious representation of the true distribution of 
the underlying model (for example, it might determine the parametrization of the 
model which has the smallest possible dimension). On the other hand, the model 
order often has a concrete interpretation in terms of the modelling of the underly- 
ing phenomenon (for example, the estimation of the number of clusters in a data 
set, or the number of regimes in an economic time series). Therefore, the problem 
of estimating the model order from observed data is of significant practical, as well 
as theoretical, interest. 

Of course, a satisfactory solution to this problem must provide an estimation 
method that does not assume prior knowledge on the underlying unknown distri- 
bution /*. In particular, prior bounds on model order and on parameter sets should 
be avoided. Yet, in this light, even one of the most widely used model selection 
criteria — the Bayesian Information Criterion (BIC) of Schwarz — is poorly under- 
stood. The chief motivation for the use of BIC (as opposed to other model selec- 
tion criteria, such as Akaike's Information Criterion) is that it is expected to yield 
a strongly consistent estimator of the model order. However, as is pointed out by 
Csiszar and Shields [9], almost all existing consistency proofs assume a prior upper 
bound on the order as well as compactness of the underlying parameter sets. This 
is hardly satisfactory from the theoretical point of view, and provides little confi- 
dence in the basic motivation for this method. More delicate questions, such as the 
minimal penalty that yields a strongly consistent order estimator in the absence of a 
prior bound on the order, also remain open (the problem of identifying the minimal 
penalty, which minimizes the probability of underestimating the model order, was 
also raised in [9]). 

Characterizing strong consistency of penalized likelihood order estimators hinges 
on a precise understanding of the pathwise fluctuations of the likelihood ratio statis- 
tic 

sup i n (f) - sup £ n (f), 

as n — > oo, uniformly in the model order q > q* (here £ n (f) is the likelihood of 
(Xk)k<n under the distribution / G M g ). When there is a known upper bound on 
the order q* < g max < oo and the model classes M g are parametrized by a com- 
pact subset of Euclidean space, an upper bound on the pathwise fluctuations can 
be obtained by classical parametric methods: Taylor expansion of the likelihood 
and an application of a law of iterated logarithm. This approach forms the basis 
for most consistency proofs for penalized likelihood order estimators in the litera- 
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ture, for example [14, 22, 10, 16, 6]. However, such techniques fail in the absence 
of a prior upper bound: even though each model class M q is finite dimensional, 
the full model M = M M g is infinite dimensional and, as such, the problem in 
the absence of a prior upper bound is inherently nonparametric. 1 When the model 
classes M g are noncompact, one must introduce sieves M" C M™ +1 C • • • C M q 
which complicate the problem further (in this case even the parametric theory re- 
mains poorly understood, see [15, 3, 19]). An entirely different approach based on 
universal coding theory [10, 12, 5, 7] yields pathwise upper bounds on the likeli- 
hood ratio statistic that do not require prior bounds on the order or compactness 
of the models. However, these bounds are far from tight and cannot even establish 
consistency of BIC, let alone smaller penalties (this appears to be a fundamental 
limitation of this approach due to Rissanen's theorem, see [24, 2]). 

To our knowledge, the only setting in which the pathwise fluctuations of the like- 
lihood ratio statistic has been studied in the absence of a prior bound on the order is 
that of higher-order Markov chains, where Csiszar and Shields [9, 8] proved consis- 
tency of BIC. The proofs in [9, 8] use delicate estimates specific to Markov chains, 
and do not yield minimal penalties. However, it was shown in [27] that a sharp 
bound can be obtained in the Markov chain case using techniques from empiri- 
cal process theory, the main difficulty being the dependence structure of Markov 
chains. 

The aim of this paper is to obtain generally applicable upper and lower bounds 
on the pathwise fluctuations of the likelihood ratio statistic uniformly in the model 
order q > q*, in the case of i.i.d. observations (Xk)k>i, without a prior bound on 
the model order and in possibly noncompact parameter spaces. We use empirical 
process methods as in [27], but the difficulties to be surmounted in the present set- 
ting are of a different nature. Though the Markov chain models in [9, 8, 27] suffer 
from a lack of independence, geometrically these models are exceedingly simple: 
the family of gth-order Markov chains endowed with the Hellinger distance is sim- 
ply a Euclidean ball when viewed in the appropriate parametrization. In contrast, in 
general order estimation problems, one is often faced with model classes that are 
geometrically very complex. An important case study that will be considered in 
this paper are finite mixture models (widely used in practice for clustering), which 
possess a notoriously complicated non-regular geometry. To obtain sharp bounds 
in such models, we will develop tools that can be used to obtain local and weighted 

1 One of the key issues in this setting is to understand the dependence of the fluctuations of the 
likelihood ratio statistic on the dimension of the model classes M, . However, one of the main results 
of this paper shows that for regular parametric models, the fluctuations of the likelihood ratio statistic 
uniformly in q > q* are dimension independent when a prior upper bound is assumed (cf. Remark 
2.7), which is certainly not the case in the absence of a prior upper bound. Therefore, we find that 
the pathwise fluctuations of the likelihood ratio statistic with and without a prior upper bound are 
qualitatively different. 
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entropy bounds, required for our pathwise fluctuation theorems, in models with 
non-regular geometric structure. These results are of independent interest: we are 
not aware of any existing local entropy results for models that possess a nontrivial 
geometric structure (the difficulty of obtaining local entropy bounds for mixture 
models is noted, for example, in [13, 21]). Finally, we will apply our results to 
establish strong consistency of BIC and to identify minimal penalties for order es- 
timation for general model classes, in the absence of prior bounds on the order and 
the underlying parameter set. 

The remainder of this paper is organized as follows. Section 2 introduces the 
general model under consideration, and states our results on the pathwise fluctua- 
tions of the likelihood ratio statistic. Section 3 states our general results on local and 
weighted entropies, and considers also the special case of mixture models. Section 
4 derives the consequences for order estimation. Proofs are given in section 5. 

2. Pathwise fluctuations of the likelihood ratio statistic. 

2.1. Basic setting and notation. Let (E,E,/j,) be a measure space. For each 
q, n > 1, let M" be a given family of strictly positive probability densities with 
respect to \i (that is, we assume that f fdfj, = 1 and that / > /i-a.e. for every 
/ € M"). Moreover, we assume that n>1 is a nested family of models in 

the sense that M£ C M™ +1 and C M™ +1 ' for all q,n > 1. Let M q = \J n M™, 
M«=U 9 M™,M = U g ,„M™. 

Consider an i.i.d. sequence of ^-valued random variables (Xfc)fc>i whose com- 
mon distribution under the measure P* is f*d(j,, where /* G M g *\clM g *_i for 
some q* > 1 (here cl M q denotes the L 1 (d^)-closure of M q ). The index q* is called 
the model order. Let us define the log-likelihood function 

n 

4(/) = J>g/pQ, feu. 

i=i 

Evidently £ n (f) is the log-likelihood of the i.i.d. sequence {Xk)k<n when ~ 
fd/j,. Our aim is to study the pathwise fluctuations of the likelihood ratio statistic 

sup £ n (f) - sup £ n {f) 

as n — > oo, uniformly over the order parameter q > q*. Pathwise upper and lower 
bounds on the likelihood ratio statistic are the key ingredient in the study of strong 
consistency of penalized likelihood order estimators (see section 4). 

Example 2.1 (Location mixtures). The guiding example for our theory, the 
case of location mixtures, will be studied in detail in sections 3.2 and 4.2 below. 
We presently introduce this example in order to clarify our basic setup. 
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Let E = M. d (with its Borel cr-field £) and let \i be the Lebesgue measure on 
W 1 . We fix a strictly positive probability density /o with respect to /x, and define 

fe(x) = fo(x — 9) for x, 9 G R . Fix a sequence T(n) f oo and define 

li=l i=l J 

Then M g is the family of all q-component mixtures of translates of the density /o, 
while M™ is the subset of the mixtures M g whose translation parameters (0i)j=i,... J g 
are restricted to a ball of radius T{n). The number of components q* of the true 
mixture feM can be estimated from observations using the order estimator 

q n = argmax I sup £ n (f) - pen(n, q) ) . 
q >i [/eM« J 

Pathwise control of the likelihood ratio statistic allows us to identify what penalties 
pen(n, q) and cutoff sequences T(n) yield strong consistency of q n (cf. section 
4.2). 

Remark 2.2. To avoid measurability problems and other technical compli- 
cations, we employ throughout this paper the simplifying convention that all un- 
countable suprema (such as supy eM n £ n (f)) are interpreted as essential suprema 
with respect to the measure P*. In the majority of applications the model classes 
M" will be separable, in which case the supremum and essential supremum coin- 
cide. 

In the sequel, we will denote by || • \\ p the L p (/*d//)-norm, that is, \\g\\p = 
f \g(x)\ p f*(x)/2(dx), and wedenoteby (/, g) = f f(x)g(x)f*(x)fi(dx) the Hilbert 
space inner product in L 2 (f*dfi). Define the Hellinger distance 

Kf, gf = J(Vf- Vafdv, f, g e M. 

It is easily seen that h(f,f*) = ||\////* — 1 1 1 2 - Finally, we will denote by 
N(Q, 5) for any class of functions Q and S > the minimal number of brackets 
of L 2 (f*d[i)-width 6 needed to cover Q: that is, N(Q, 5) is the smallest cardinality 
N of a collection of pairs of functions {gf , gY}i=i,...,N such that maxj<Ar \\gY — 
gf\\2 < & and for every g € Q we have gf < g < gf pointwise for some i < N. 

2.2. Upper bound. We aim to obtain a pathwise upper bound on the likelihood 
ratio statistic that holds uniformly in q > q*. To this end, define for q, n > 1 and 
e > the Hellinger ball 

?W = = / € MJ, h(f, n < e}. 
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Note that the definition of 9£g(e) depends on /* (which is fixed throughout the 
paper). The following result shows that the geometry of the Hellinger balls JC™ (e) 
controls the pathwise fluctuations of the likelihood ratio statistic. 

THEOREM 2.3. Suppose that for all n sufficiently large, we have 



N(!K?( E ),<J)< t () - 
for all q > g* and 6 < e, with K(n) > 1 and 77(g) > q increasing functions. Then 

limsup- — 1 — sup — !— I sup 4(/) - sup £ n (f) \ < C 

n-K» log K{2n) V log log n q > q * 77(g) [/ s mj /eMJ 4 J 

P*-a.s., where C > is a universal constant. 

The proof of Theorem 2.3 is given in section 5. 1 below. 

The assumption of Theorem 2.3 on the entropy of the Hellinger balls M"(e) 
states, roughly speaking, that the class of densities M™ endowed with the Hellinger 
distance has the same metric structure as a Euclidean ball of dimension 77(g) and 
radius of order K(n), at least locally in a neighborhood of the true density /*. The 
effective dimension 77(g) controls the fluctuations of the likelihood ratio statistic as 
a function of the model order, while the effective radius K(n) controls the fluctua- 
tions as a function of time up to a minimal rate of order log log n. In the following 
section we will see that the minimal log log n rate is indeed optimal. 

Let us note that the geometric structure required by Theorem 2.3 is far from ob- 
vious in many cases of practical interest. For example, in the case of finite mixtures, 
the geometry of the parameter sets corresponding to Hellinger balls is notoriously 
complex and highly non-regular, but we will nonetheless verify the assumption of 
Theorem 2.3 (see section 3.2). In order to apply Theorem 2.3 in such cases, we 
therefore need to develop tools to establish local entropy bounds in models that 
possess nontrivial geometric structure. Section 3 below is devoted to this problem. 

2.3. Lower bound. Throughout this section, we specialize to the case that M" = 
M g does not depend on n (this implies essentially that M. q is compact). In this set- 
ting, Theorem 2.3 yields an upper bound of order log log n on the pathwise fluctua- 
tions of the likelihood ratio statistic. The aim of this section is to obtain a matching 
lower bound of order log log n, which shows that the minimal rate in Theorem 2.3 
is essentially optimal. For the purposes of a lower bound, the uniformity in q is ir- 
relevant, so that it suffices to restrict attention to some fixed q > q*. We will in fact 
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obtain a much stronger result in this case, which completely characterizes the pre- 
cise pathwise asymptotics of the likelihood ratio statistic for fixed q in sufficiently 
smooth families. 

The geometric structure required in the present section is somewhat different 
than that of Theorem 2.3. Instead of Hellinger balls, we consider the classes of 
weighted densities T> q = {df. f e M„ / /*} and D = \J q T> q , where 



In addition, we define for e > and q > 1 the local weighted classes 

D q {e) = {df : / G M q , < h(f, /*) < e}, £ 9 = f| dD,(e), 

e>0 

where the closure clD g (e) is in L 2 (f*dfi). Evidently !D g is the set of all possible 
limit points of df as h(f,f*) — > in M 9 . If the neighborhoods of T) q are suf- 
ficiently rich, such limits can be taken along a continuous path in the following 
sense. 

DEFINITION 2.4. A point d G T> q is called continuously accessible if there 
is a path (/t)t e ]o,i] C M g \{/*} such that the map t i-)- fa(/t, /*) is continuous, 
Kft, /*) -> as t -> 0, and d/ t -4 d in L 2 (f*dfi) as f -4 0. The subset of all 
continuously accessible points in £> g will be denoted as 

We can now formulate the main result of this section. 

Theorem 2.5. Let q* < p < q. Assume that 

i 



y^log u) du < oo, 

and that \d\ < D for all d G D q with D G L 2+a (f*dfj,)for some a > 0. Then 



limsup- — \ sup i n {f) - sup 4(/) i > 

n->oo log log n 1/gM, feM p 



sup ^ sup ({f,g))% - sup ((/,#))+ > P*-a.j., 
g£L 2 (f*dn) {feV- fev p J 
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as well as 



limsup- — \ sup £ n (f) - sup £ n (f) \ < 

n->oo log log 71 [/gM ? /GM P J 

sup < sup ((f,g})% - sup ((/,#))+> P*-a.s., 

wfere Lg(/*d/x) = { 9 £ L 2 (/*^) : ||s|| 2 < 1> (1^) = 0}. 

Only the first (lower bound) part of the theorem is needed to conclude optimality 
of the minimal log log n rate in Theorem 2.3. Indeed, we will obtain as a corollary 
the following lower bound counterpart to Theorem 2.3. 

COROLLARY 2.6. Suppose there exists q > q* such that the following hold. 

1. There is an envelope function D : E —> R such that \d\ < D for all d G T) q 
and D G L 2+a (/* dp) for some a > 0. Moreover, f Q ^/\og N(T) q ,u) du < 
oo. 

2. 2)g\2)q* jj nonempty. 

Let r](q) > be an arbitrary positive function. Then 

limsup - — I sup — l — \ sup £ n (f) - sup £ n (f) > > C 

n ->oo log log n g >g* 77(g) [f € M q feM q * J 

P*-a.5., where Co > nonrandom but may depend on f* and r]. 

The proofs of Theorem 2.5 and Corollary 2.6 are given in section 5.2 below. 

The fact that the geometric assumptions in Theorem 2.5 and Corollary 2.6 are 
expressed in terms of weighted classes is not surprising, as the sharp asymptotic 
expression provided by Theorem 2.5 for the pathwise fluctuations of the likeli- 
hood ratio statistic are expressed in terms of a variational problem on the weighted 
classes. Nonetheless, we are naturally led to ask whether there is any relation be- 
tween the geometric assumptions imposed in the upper bound Theorem 2.3 and the 
lower bound Theorem 2.5, which appear to be quite different at first sight. In sec- 
tion 3, we will show that the global entropy of the weighted class is closely related 
to local entropy, so that the geometric assumptions for the upper and lower bounds 
are not too far apart. Beside the fundamental value of this observation, the relation 
between global and local entropies will prove to be an essential tool in order to 
verify these geometric assumptions in models with a complicated geometry, such 
as finite mixture models. 
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Remark 2.7. When T> q and T> p each contain an L 2 (/*d/x)-dense subset of 
continuously accessible points (which is typically the case in sufficiently smooth 
models), then Theorem 2.5 provides the exact characterization 



Beside its intrinsic interest, this result has a surprising consequence. In the case 
that M q and M p are regular parametric models with dim(M q ) > dim(M p ), one 
can choose g G T) g which is orthogonal to T> p . As T) q , T> p C L^(f*dp) (see the 
proof of Corollary 2.6), it follows easily that in this case the right-hand side of 
the previous equation display is precisely equal to 1. In particular, we obtain the 
curious conclusion that in regular parametric models, the magnitude of the fluctu- 
ations of the likelihood ratio statistic does not depend on the dimensions dim(M g ) 
and dim(Mp). In contrast, it is well known that in regular parametric models, the 
likelihood ratio statistic itself converges weakly to a chi-square distribution with 
dim(M g ) — dim(Mp). degrees of freedom, so the tails of the distribution of the 
likelihood ratio statistic do in fact depend strongly on the dimensions dim(Mq) 
and dim(Mp). Of course, the dimension independence of the pathwise fluctuations 
will also cease to hold if we are interested in a result that is uniform in the order q, 
as in Theorem 2.3. 

3. Entropy bounds. In section 2, we obtained pathwise bounds on the fluc- 
tuations of the likelihood ratio statistic in terms of the geometry of the underlying 
model classes. However, we have required two distinct types of geometric condi- 
tions: local entropy bounds for classes of densities, and global entropy bounds for 
classes of weighted densities. In this section, we will show that the latter implies 
the former under appropriate conditions, so that a suitable global entropy bound 
for weighted densities suffices for all the results in section 2. We will subsequently 
show how the requisite entropy bounds can be obtained for the case of location 
mixtures (cf. Example 2.1). The latter is significant both as an important applica- 
tion, and as a nontrivial case study in obtaining local entropy bounds in models 
with a complicated geometry. 

3.1. From global entropy to local entropy. We are going to establish that local 
entropy estimates for a class of densities M can be obtained from global entropy 
estimates on the associated weighted class D. To this end, let us consider for the 
purposes of this section a general class of positive probability densities M with 
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respect to some reference measure p, a fixed /* G M, and define the class of 
weighted densities T> = {dj : / G JVC, / ^ /*}. In addition, we define for 5 > 
the Hellinger ball "K{5) = {y/f/f* : h[f,f*) < 5}. We obtain the following 
result, whose proof is given in section 5.3. 

Theorem 3.1. Suppose that there exist q, Co > 1 and £q > such that 
W(D, e) < for every e < e . 

Let R > sup j- \df\be an envelope function such that \\R\\2 < oo. Then 



p 

for all 5,p> such that p/5 < 4 A 2||i?|| 2 , where C\ = 8C (1 V ||i?|| 2 /4e )- 

Of course, in the setting of section 2, we would apply this result to JVC?, D™, 
"K™ (e) for given n, q instead of to JVC, D, "K(e). 

3.2. The entropy of mixtures. We now develop the requisite entropy bounds in 
the case of mixtures (Example 2.1). In this section, let p, be the Lebesgue mea- 
sure on E d . We fix a strictly positive probability density /o with respect to p, and 
consider mixtures of densities in the class 

{f e :e€R d }, f e (x) = fo(x- 9) Vx G R d . 

In everything that follows we fix a nondegenerate mixture f* of the form 

q* 



Nondegenerate means that it* > for all i, and 0* ^ 6* for all i ^ j. 

Let G C M. d be a bounded parameter set such that {6* : i = 1, . . . , q*} C 0, 
and denote its diameter by 2T (that is, © is included in some closed Euclidean ball 
of radius T). We consider for q > 1 the family of g-mixtures 

M ? = j^TTi/e. : 7Tj > 0, = 1, 0i € el , 

li=l i=l J 
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and define the class of weighted densities as T> q = {df : f G M q , ///*}. Let 
H (x) = sup/(9(x)//*(x), 

ff 1 (x) = sup max \df g (x)/d6 i \/f*(x), 

g e Qi=l,...,d 

H 2 (x) = sup max \d 2 f e {x)/de i de i \/ f*{x), 
0eQi>3=l>—,d 

H 3 (x) = sup max \& i f e (x)/de i de j de k \/f*(x) 

g e Qi,j,k=l,...,d 

when /o is sufficiently differentiable, and let M = [j q>1 Mq and D = (L>i 

Remark 3.2. In the setting of Example 2.1, the parameter set = 6(n) de- 
pends on n, and we then write M™ instead of M. q , etc. However, as the dependence 
on n is irrelevant for the entropy computation, we consider a fixed parameter set 
in this section, and drop the dependence on n in our notation for simplicity. 

We can now state the result of this section, whose proof is given in section 5.4. 

Assumption A. The following hold: 

1. /o 6 C 3 and fo(x), (dfo/d9' l )(x) vanish as ||x|| — > oo. 

2. H k e L 4 (f*dfi) for k = 0, 1, 2 and H 3 G L 2 (f*dfi). 

THEOREM 3.3. Suppose that Assumption A holds. Then there exist constants 
C* and 5*, which depend on d, q* and /* but not on 0, q or 5, such that 

18(d+l)g 

II " ! / v I i 1 ••( II u, A : V I //ill: V I / /.-.I v II / /■> II- I \ 



C*(T v lMlltfoHt v \\H4i V [|fl- 2 [|| V \\H 3 g) 



5 I 

for all q > q*, 5 < 5*. Moreover, there is a function D £ L 4 (/*d/x) with 

\\D\U<K*(\\H \\ 4 V WH^U V||^ 2 || 4 ), 
where K* depends only on d and /*, 5«c/z f/zaf |d| < D for all d £ T>. 

Let us note that a key aspect of this result is that the dependence of the entropy 
bound on the order q and on the parameter set is essentially explicit (see Exam- 
ple 3.5 below, for example). However, even for fixed q and 0, the existence of a 
polynomial bound on the bracketing number of T) q is far from obvious (previous 
claims [16, 6, 1] that such bracketing numbers are polynomial were stated without 
proof). 

Define the Hellinger ball "K q {e) = {yT/T* ■ f € M q , h{f, /*) < e}. Using 
Theorem 3.1, we immediately obtain the following result on the local entropy of 

Mq. 
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COROLLARY 3.4. Suppose that Assumption A holds. Then 



C e e 



) 



I8(d+l)g+l 



6 



for all q > q* and 5/e < 1, where 



C e = L* (T V l) 1 / 6 (\\H \\i V WH.Wi V \\H 2 \\i V ||# 3 ||i) 5/4 



and L* is a constant that depends only on d, q* and /*. 

Example 3.5 (Gaussian mixtures). Consider mixtures of standard Gaussian 
densities f (x) = (2vr) ~ e~ 1 1 ^ II 2 / 2 , and let 9(T) = {9 G R d : ||0|| < T}. 
Fix a nondegenerate mixture /*, and define T* = maxj = i r .. )g * ||0*||. Denote by 
T) the Hellinger ball associated to the parameter set 0(T). Then 



for all g > g*, T > T*, and 5/e < 1, where C*, are constants that depend on 
d, q* and /* only. To prove this, it evidently suffices to show that Assumption A 
holds and that ||i?fe||4 for k = 0, 1, 2 and 1 1 1 1 2 are of order e CT . These facts are 
readily verified by a straightforward but tedious computation. 

Remark 3.6. We have not optimized the constants in Theorem 3.3 and Corol- 
lary 3.4. In particular, the constant 18 in the exponent can likely be improved. 
On the other hand, it is unclear whether the dependence on the diameter of 6 
is optimal. Indeed, if one is only interested in global entropy Ji{"K q ,5) where 
^-<? = { V f / f* • / e M?}, then it can be read off from the proof of Theorem 
3.3 that the constants in the entropy bound depend on ||i?o||i an d ||-Hi||i only, 
which are easily seen to scale polynomially in T due to the translation invariance 
of the Lebesgue measure. Therefore, for example in the case of Gaussian mixtures, 
one can obtain a global entropy bound which scales only polynomially as a func- 
tion of T, whereas the above local entropy bound scales as e CT . The behavior of 
local entropies is much more delicate than that of global entropies, however, and 
we do not know whether it is possible to obtain a local entropy bound that scales 
polynomially in T. 




The proof of Theorem 3.3 is long and rather technical. Nonetheless, there are 
some key ideas underlying the proof, which we aim to briefly explain here. 
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(a) 00 




FIG 1. Let fe(x) = \/2/ty e~ 2 ( x ~ 6 ^ and f* = /o.s, and consider the mixture family M2 = 
{pfsi + (1 — p)fe 2 '■ P, #1,02 £ [0, 1]}. 77ie plots illustrate (a) the set of parameters (p, 61,82) 
corresponding to the Hellinger ball {/ G M2 : h(f, /*) < 0.05}; and (b) the set of parameters 
{(p, 01,0a) : N(p,e 1 ,e 2 ) < 0.05} with N(p, 81,82) = |p(0i-O.5) + (l-p)(0 2 -O.5)| + |p(0i- 
0.5) 2 + |(1 — p)(02 — 0.5) 2 . 77ie plots are related by the local geometry Theorem 5.11, which 
yields c*N(p,e 1 ,e 2 ) <h(pf ei + (l-p)fe 2 ,f) < C* N(p,8i,8 2 )for all p,8i,8 2 e [0,1], 



The classical approach to controlling local entropies of a parametric class 9 = 
{g^ : £ G H} with H C M d is as follows (cf. [26], Example 19.7). Suppose that the 
square root densities = sjg^j g^* satisfy the pointwise Lipschitz condition 

\h^x)-h e (x)\<H(x)\U-e\\\, CC'eS, 

where H is a function in L 2 and |||-||| is a norm on H. Suppose, moreover, that 

h(jgz,gt*)>c\U-e\l (6 5. 

Define S(e) = {£ G S : |||£-£*||| < ^ and 5C(e) = : £ G S, % e ,<? e *) < e}. 
If |||£ — £'||| < 6, then /i^/ — 5// < < + 5H. Therefore, we can control the 
local bracketing entropy by X(JC(ce), 2<J[|J9 r [| 2 ) < JV(E(s), 5), where iV(H(e), <J) 
denotes metric entropy. But the metric entropy of a ball can be controlled by a 
standard volume comparison argument, yielding JV(S(e), 5) < ((2e + 5)/5) d . 

Clearly the above properties require c|||£ — £*||| < h(g^,g^*) < ||i?||2|||£ — £*||| 
for all £ G S. Therefore, such an approach can only work when the class 9 endowed 
with the Hellinger distance has a regular geometry (i.e., equivalent to a subset of a 
finite dimensional Banach space), at least in a neighborhood of the true parameter. 
This fails miserably in the case of mixture classes M 9 , which possess a highly non- 
regular geometry in a neighborhood of /* when q > q*. In fact, it is easily seen that 
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h(f,f*) = does not even select a unique set of parameters (7^, #i)i=i,..., g , as mix- 
ture models are non-identifiable, and consequently the Hellinger balls !K q (e) look 
nothing like norm-balls when viewed as a subset of the parameters fa, #i)i=i 9 
(cf. Figure 1). Thus we are faced with two basic difficulties: 



1. How does one control the subset of parameters fa,6i)%= 
to the Hellinger balls K q {e)l 

2. How does one control the metric entropy of these sets? 



. g corresponding 



The resolution of the first problem requires us to develop a precise understanding 
of the local geometry of mixture classes, which is done in Theorem 5.11 below. 
One key consequence of this result, for example, is as follows: one can choose suf- 
ficiently small neighborhoods A\ , . . . , A q * of 9\ , . . . , 9 q *, respectively, such that 



the Hellinger distance h(f, f* 
pseudodistance 



is bounded above and below up to a constant by the 




E 



TT; 



TT; 



+ 



+ 



7 



(here / = Y^i=i 7T ife t and A = R d \(A 1 U ••• U Aq*)). This pseudodistance 
quantifies precisely (and rather intuitively) the set of parameters with density close 
to /*, see Figure 1 for an illustration in the simplest possible case. 

As for the second problem, we avoid it entirely by exploiting Theorem 3.1 in- 
stead of computing directly the local entropy. Using the local geometry Theorem 
5.11 and Taylor expansion, we can approximate the weighted densities df by linear 
combinations of their first and second derivatives with coefficients in a Euclidean 
ball. The entropy of the latter is easily estimated by the Lipschitz argument indi- 
cated above. However, the details are somewhat intricate: Taylor expansion should 
only be applied to parameters 9j that lie close to some 9*, which requires careful 
bookkeeping. 

The local geometry Theorem 5.11 and the relation between global entropy of 
weighted densities and local entropy developed in Theorem 3. 1 are key ideas that 
allow us to obtain local entropy estimates in a geometrically nontrivial model. Let 
us note that the restriction to location mixtures is only used in the proof of Theorem 
5.11. We believe that the same technique is applicable to other classes of mixtures 
(for example, Poisson mixtures or mixtures of densities in an exponential family) 
provided that the proof of Theorem 5.11 can be adapted to this setting. 

4. Strongly consistent order estimation. The goal of this section is to apply 
the results of sections 2 and 3 to identify what penalties and cutoffs yield strong 
consistency of penalized likelihood order estimators. We first develop some general 
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consistency and inconsistency results, and then consider specifically the problem 
of mixture order estimation. 

4.1. Consistency and minimal penalties. In this section we consider the gen- 
eral setting introduced in section 2.1. We now suppose, however, that the true model 
order q* (as well as the true density /*) is not known, so that we must aim to esti- 
mate q* from an observation sequence (Xk)k>i- To this end, define the penalized 
likelihood order estimator as 

q n = argmax I sup £ n (f) - pen(n, q) > , 

q>l [/GM™ J 

where pen(n, q) is a penalty function. Our goal is to show that the penalized likeli- 
hood order estimator is strongly consistent, that is, q n — > q* as n — > oo P*-a.s., for 
a suitable choice of the penalty (that does not depend on q* or /*). Let us empha- 
size that the maximum in the definition of q n is taken over all model orders q > 1, 
that is, we do not assume that an a priori upper bound on the order is available, 
in contrast to most previous work on this topic. We obtain the following general 
result. 



THEOREM 4.1. Suppose that for all n sufficiently large, we have 

*<*.(.),„< (^)£)" W 

for all q > q* and 8 < e, where K(n) > 1 and rj(q) > q are increasing functions 
and we assume that log K{n) = o(n). Let pen(n, q) be a penalty such that 

rj(q) {log K(2n) V log log n} pen(n,q) _ 
lim sup r r — — U, lim max — u, 

n^oo q>q * pen(n, q) — pen(n, q*) n->ooq<q* n 

and pen(n, q) is increasing in q. Then q n — > q* as n — > oo P*-a.s. 
Theorem 4.1 is proved in section 5.5 below. 

Let us now specialize to the case that M™ = M q does not depend on n, as in 
section 2.3. In this case, Theorem 4.1 immediately yields the following corollary. 

COROLLARY 4.2. Suppose that for all q > q* and 5 < e 



$ J 
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where K > 1 and rj(q) > q is a strictly increasing function. Define the penalty 

pen(n,q) = rj(q)w(n), 
where w{n) is any function such that 

lim logbgn = lim ^) =0 . 

n->oo W[n) n— >oo n 

Then q n — > q* as n — > oo P*-a.s. 

Corollary 4.2 states that, when M q = M q does not depend on re, the penalized 
likelihood order estimator is strongly consistent provided the penalty grows faster 
than log log n and slower than n. Clearly the log log n rate is the minimal one at- 
tainable by applying Theorem 4. 1 . This raises the question whether the log log n 
rate is indeed minimal, in the sense that smaller penalties yield inconsistent esti- 
mators. The following result (which follows easily from Theorem 2.5) shows that 
this is indeed the case, so that the result of Corollary 4.2 is essentially optimal. 

COROLLARY 4.3. Suppose there exists q > q* such that the following hold. 

1. There is an envelope function D : E — )• R such that \d\ < D for all d € T> q 
and D G L 2+a (f*dfi) for some a > 0. Moreover, ^log N(D g ,ii) du < 
oo. 

2. T> c q \T> q * is nonempty. 

Let r](q) > be any strictly increasing function, and define the penalty 

pen(n,q) = C r](q) log log n. 

If the constant C > is chosen sufficiently small, then q n ^ q* infinitely often 
P*-a.s. 

The proof of Corollary 4.3 is given in section 5.5. Let us note that the proof of 
Corollary 4.3 actually shows that supj eM? — pen(n, q) > supj gM ^ £ n (f) — 
pen(n, q*) infinitely often P*-a.s., so the conclusion of Corollary 4.3 is not altered 
even if we were to impose a prior upper bound on the order. 

In conclusion, we have shown that when M" = Mq does not depend on n, penal- 
ties growing faster than log log n are consistent while the penalty C r](q) log log n 
is inconsistent when the constant C is sufficiently small. From the proof of The- 
orem 4.1, we can also see that the penalty C rj(q) log log n is consistent when C 
is sufficiently large. However, the critical value of C may depend on the unknown 
parameter /*, so that this minimal penalty may not be implementable. On the other 
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hand, assuming that rj(q) does not depend on /* (as is typically the case), penal- 
ties satisfying the assumptions of Theorem 4. 1 obviously do not depend on the 
unknown parameter /* and therefore define admissible estimators. When M" de- 
pends on n, larger penalties may be required to ensure consistency, depending on 
the growth rate of K(ri). 

4.2. Mixture order estimation. We finally apply the results in the previous sec- 
tion to mixture order estimation. Throughout this section, let E = M. d and let /j, be 
the Lebesgue measure on M. d . Fix a strictly positive probability density /o with 
respect to //, and define 

{q q 
5>t/fl 4 : TTi >0, 5>i = l, 0i€6(7 
i=l i=l 

where /e(x) = /o(x — 0) and • • • C Q(n) C 8(n + 1) C • • • c M. d is an increas- 
ing family of bounded subsets of R. d . We fix /* G M throughout this section. In 
the following, we consider two separate cases. The first case is that of a compact 
parameter set, where G (n) = G does not depend on n. In this setting, we obtain a 
general result. Then, we consider the noncompact case in the setting of Gaussian 
mixtures, and illustrate how Theorem 4. 1 can be used to obtain consistency results 
in this case. 

Let us first consider the case of a compact parameter set. Then we obtain a 
general consistency result under Assumption A (cf. section 3.2). 

PROPOSITION 4.4. Suppose that the parameter set 0(n) = is a bounded 
subset ofM. d independent ofn, and that Assumption A holds. If we choose a penalty 
of the form 

/ \ , s v loglogn u){n) 
pen{n,q) = qco{n), hm — — = hm = (J, 

n->oo (jj{n) n— >co n 

then q n — )• q* as n —> oo P*-a.s. On the other hand, if we choose the penalty 

pen(n, q) = C q log log n 

with a sufficiently small constant C > 0, then q n ^ q* infinitely often P*-a.s. 

We therefore find that in the setting of location mixtures with a compact pa- 
rameter set, the minimal penalty is of order log log n. Moreover, the popular BIC 
penalty 

(4.1) pen(n, q) = — — ^ logn 
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yields a strongly consistent mixture order estimator in this setting, without a prior 
upper bound on the order. The requisite Assumption A is a very mild one, which 
highlights the broad applicability of this result. However, the assumption of a com- 
pact parameter space can be quite restrictive in practice. 

Let us therefore consider a case where the parameter space is noncompact. 
For simplicity we restrict our attention to Gaussian mixtures, that is, we choose 
fo(x) = (27r) — d / 2 e — H^'l / 2 , and we choose the restricted parameter sets 6(n) = 
{6 G M. d : \\9\\ < T(n)} for some sequence T(n) f oo- Our aim is to choose the 
penalty pen(n, q) and cutoff T(n) so that the penalized likelihood order estimator 
is strongly consistent. In this setting, we obtain the following result. 

Proposition 4.5. Let f (x) = (2ir)-^ 2 e-^ a ^ and 0(n) = {0 e R d : 
|| #|| < T(n)}, and choose a penalty of the form pen(n, q) = quj(n). If 

log log n to(n) 



lim — — — = lim = 0, Tin) = Oi-Jlog logn), 

n^oo uj[n) n-»oo n 

then q n — > q* as n — > oo P*-a.s. On the other hand, the BIC penalty (4.1) yields a 
strongly consistent order estimator ifT(n) = o(y/logn). 

This result illustrates that our theory can establish consistency of the penalized 
likelihood mixture order estimator without any prior upper bounds on the model 
order or the magnitude of the true parameters. Let us note that there is nothing 
particularly special about the Gaussian case: a similar result can be obtained, in 
principle, for any mixture distribution, as long as one can obtain suitable estimates 
on the quantities ||flj||4 that appear in Corollary 3.4 (see Example 3.5 for the Gaus- 
sian case). 

The proofs of Propositions 4.4 and 4.5 appear in section 5.6 below. 



5. Proofs. 



5.1. Proof of Theorem 2. 3. The proof of Theorem 2. 3 is based on the following 
deviation bound for the log-likelihood ratio. This bound is essentially from [25], 
Corollary 7.5, but the additional maximum inside the probability is essential for 
our purposes. 

THEOREM 5.1. Let M be a family of strictly positive probability densities with 
respect to a reference measure fi, fix some f* G M, and define the Hellinger ball 
= {77T7 7 : / € M, h(f,t) < e} where h(f,g) 2 = J - Jgf dfx. 
Suppose that for some constants K > 1, p > 1 and all 5 < e 
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where 3sf(?f (e), 5) is the minimal number of brackets of L 2 (f*dfj,)-width 5 needed 
to cover Let (JQ)j e N be i.i.d. with distribution f*dfj,. Then 



max 

n<k<2n 



a 



<Ce- a ' c 



for all a > Cp(l + log K) and n > 1, where C is a universal constant. 

PROOF. Define / = (/ + /*)/2 for any / G M, and define the empirical 
process u n (g) = n~ 1 / 2 J2k=ii9( x k) ~ E\g(X k )]}. Using concavity of logx we 
have 

5> g - 2fcl/2 ^( lo s(/7/*)) - MD(r\\f), 

where D(f*\\f) = /log(/7/)/*d/z is relative entropy. As D(f*\\f) > h(f,P) 2 



max 

n<k<2n 



* ( f(X 3 ) \ 

SUp 2^ log 777TTT > 

/£M^ \r{Xj)J 



a 



< p 



max S up{2fc 1 /^ fc (i og ( / / / *)) _ 2 M(/ 5 f*) 2 } > a 

n<k<2n Jg]y[ 



5 



max 

n<k<2n 



s=0 
S 

< 3 > max P 

s=0 



sup |fc 1/2 ^(log(/7/*))| > 02 s " 1 

/£M:nh(/,f ) 2 <a2 s 

sup |^(log({//n 1/2 ))| > <*2 s - 5 n 

/6M:/i(/,/*) 2 <o2 s n _1 



-1/2 



where S 1 = min{s : a2 s n 1 > 2}, and we have used Lemma 5.2 below for the 
last inequality. The remainder of the proof is identical to that of [25], Theorem 7.4 
provided we show that for 5£(e) = {y/f/f* : f € M, /*) < e} 



To this end, fix 5 < e, and note that /*) < 4/i(/, /*) by [25], Lemma 4.2 so 
that {/ G M : h{f, /*) < e} C {/ G M : &(/, /*) < 4e}. By assumption, there 
exist N < (2\[2Ke I $) v and functions g\,. . ■ ,gN, hi, ... , hisr such that \\hi — 
9i\\2 < ^72 for every i, and for every u G JC(4e) there is an i such that gi < u < 
hi. But for every / G M such that /*) < e, we then have for some i 



- 1/2j 9* + i< 7/7/* < 2-vy fcf + 1. 
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Moreover, using \y/a + c — \/b + c\ < | -^/a — Vb\ for a, b, c > we obtain 

<2- 1 / 2 ||/i i - 5i || 2 <,5. 



2- 1 /V^ + i-2-vy 5 2 + 1 

The result now follows directly. 



□ 



The following variant of Etemadi's inequality was used in the proof. The proof 
follows closely that of the classical Etemadi inequality, see [4], Appendix M19. 

LEMMA 5.2. Let Q be a family of measurable functions f : E — )■ R. Then we 
have for every a > and m,n G N, m < n 



max sup|5fc(/)| > 3a 

k=m,...,n f£Q 



< 3 max P* 

fc=m,...,n 



sup|5 fe (/)| > a 
/eQ 



where S n (f) = n 1 ' 2 ^). 



PROOF. Define the stopping time r = inf {fc > m : supj g g \S k (f)\ > 3a}. 
Then 



max sup |Sfc(/)| > 3a 

k=m,...,n JgQ 



P*[r < n] 



sup|5 n (/)| > a 
/eQ 



k=m 



t = k and sup |S n (/)| < a 
/eQ 



But on the event {r = k and supj eQ l-SnC/)! < a}, we clearly have 

2a < sup \S k (f)\ - sup \S n (f)\ < sup \S k (f) - S n (f)\. 
/eQ /eQ /eQ 

Therefore, we can estimate 



max sup | Sk(f) | > 3a 

k=m,...,n JgQ 



< P* 



< P* 



sup|5„(/)| > a 

sup|5„(/)| > a 

/eQ 



fc=m 



r = fcand sup \S n (f) - S k {f)\ > 2a 
/eQ 



+ max P* 

k=m,...,n 



sup|5 n (/)-5 fc (/)| >2a 
/eQ 



where we have used that sup^ gQ 15^(7) — <5fc(/)l an d {t = k] are independent to 
obtain the last inequality. The proof is now easily completed. □ 
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We can now complete the proof of Theorem 2.3. 

Proof of Theorem 2.3. By assumption, we have /* e for all q > q* 
when n is sufficiently large. Then by Theorem 5.1, we have for n sufficiently large 



max sup {4(/) - 4(/*)} > a 



n<k<2n j gM 2 



for all a > Cr)(q)(l + logif(2n)) and g > q*. Using that C JA 2 q n for n < 
k <2n and 4(/*) < su P/gM\ ^k(f), we have for n sufficiently large 



max sup -7-r< sup 4(/) - sup 4(/) > > a 
n<k<2n q > q * rj(q) I /gM k /eM \ I 



9=9* 



for all a > C(l + log if (2n)). Let /3(n) be an increasing function. Then 



111 1 2C 

2"<fc<2«+i /3(fc) g > 9 * r/(g) I /eM fc /eM fc 1 /?' 



for all n sufficiently large, provided that (3(2 n ) > logiT(2 n+1 ) V log log 2 n . The 
proof is now easily completed using the Borel-Cantelli lemma. □ 

5.2. Proof of Theorem 2.5. The proof of Theorem 2.5 is based on a sequence 
of auxiliary results. First, we will need a compact law of iterated logarithm for the 
Strassen functional 



ln(g) 



1 



y/2n log log 



I? 



X>pQ) - E*( 5 (Xi))} 



8=1 



We state the requisite result for future reference. 

THEOREM 5.3. Let Q be a family of measurable functions f : E 



with 



\/log3sf(Q,u) du 



< oo. 



Then, P*-a.s., the sequence (I n ) n >o is relatively compact in ^oo(Q)» and its set of 
cluster points coincides precisely with the set % = {/ i-)- (/, g) : g € Lq(/*cZ/x)}. 

Proofs of this result can be found in [23], Theorem 4.2 or in [17], Theorem 9. 
We will also need the following simple result, whose proof is omitted. 
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Lemma 5.4. Let (Xj)j>i be an i.i.d. sequence of random variables, and sup- 
pose E[|A"i| p ] < oo. Then n~ l ' p maxj = i ; ....„ |Xj| — > a.s. as n — > oo. 

Finally, we will need the following likelihood inequality that relates the log- 
likelihood ratio i n (f)—£n(f*) to the empirical process. Related inequalities appear 
in [11, 18, 6], but the following form is perhaps the most natural. 

LEMMA 5.5. For any strictly positive probability density f ^ /* we have 

L(f)-in(n<\Md f )\ 2 , 

where v n (g) = n~ x l 2 ^fc=i{fl'(^fc) ~~ E*[ff(-^fc)]} denotes the empirical process. 
Proof. Note that 

h(f, rf = 2 - J 2^ dpi = -2 h(f, /*) E*(df(Xi)). 

Using log(l + x) < x, we can estimate 



>(/) - un = E 2 + h v> /*) d f( x i)) < E 2 h (f> /*) 

i=l i=l 

= 2 i/„(d/) r ) - r ) 2 n < sup {2v n {d f ) p-p 2 }. 



The proof is easily completed. □ 

We can now obtain the following asymptotic expansion of the log-likelihood, 
which provides a pathwise counterpart to the weak convergence theory in [11, 18]. 

Proposition 5.6. Let q > q*. Assume that 

-l 



y^log u) du < oo, 

and that \d\ < D for all del), with D G L 2+a (f*dfj,)for some a > 0. Then 

sup Ul n {d f ) h(f, /*) J _ />(/, f*f X 

I Viogbgn logiogn] 



/GM^^loglogn/n) 



1 



sup £„(/) - > ^=^> P*-a.j., 



log log n y eMi 
where we have defined M ? (e) = {/ G M g : /*) < e}. 
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PROOF. We proceed in several steps. 

Step 1 (localization). As q > q* (hence f* G M q ), clearly 

sup e n (j) - £ n (r) = sup {e n (f) - e n (f*)} . 

/£M, /eM,:«„(/)-«„(f)>0 

Now note that, as in the proof of Lemma 5.5, 



US) ~ tntf*) < 2 Mdf) Kf, /*) - Kf, rf 
Therefore, we can estimate 

sup h(f,f) 
feM q -.£ n (f)-e n (f*)>o 

£n(f)-Uf*) 



n. 



< sup \h(f,n + 



s 2 ia\s /Slog log n 

< —= sup v n (df) < \ sup I n (d). 

V n feM q -.e n (f)-i n {f*)>o V n d&q 

Now note that we can estimate 

sup I n {d) < inf sup \I n (d) — {d,g) \ + sup sup (d,g). 

The first term on the right converges to zero P*-a.s. as n — > oo by Theorem 5.3, 
while the second term is easily seen to equal sup rfgI) (1, d) H2 < 1- Therefore 



ur* s±\ ^ /1 , \ / 8 log log n 
sup h(f,f*)<(l + e)\ 

feM q u n (f)-e n (f*)>Q V n 

eventually as n — > 00 P*-a.s. for any e > 0. In particular, we find that 

{/ € M q : £ n (f) - £ n (n > 0} C {/ G M 9 : /»(/, /*) < 4 Vlog log n/n] 

eventually as n — > 00 P*-a.s. This implies that P*-a.s. eventually as n — > 00 

sup l n {f) - l n (t) < sup {£ n (f) - £ n (f*)} . 

f eM i feM q :h(f,f*)<Ay/log log n/n 

But the reverse inequality clearly holds for all n > 0, so that in fact 

sup £ n (f) - Uf) = sup {4(/) - US*)} 

/ eM <? /eM 9 (4 A /log log n/n) 
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eventually as n — > oo P*-a.s. 

Step 2 (Taylor expansion). Taylor expansion gives 21og(l + x) = 2x — x 2 + 
x 2 R{x), where R(x) — > as x — > 0. Thus we can write, for any / £ M g , 

n 
i=l 

n r 11 n 

i=l ^ i=l 

n 

- n h(f, n 2 + ^(/, n 2 Y,(d f (x t )) 2 R(h(f, n d f (x t )). 



i=i 



Using that E*(d/(Xi)) = -/t(/, /*)/2, we therefore have 
1 



{tn(f) ~ Ut)} 



log log n 

2i n (d f )h(f,n 

where we have defined 



1 Ki.r* 2,1 



log log n 



log log n 



nh(f,rf 
log log n 



i=l 

It follows easily that 



%=i 



sup 



/6M, (4-y/log log n/n) 



2I n (d,)h{S.f")\j 



1 Hf.r* 2 " 



log log n 



log log n 



sup e n (f)-e n (f 



< 



sup 



log log n [ /eM , 

|% „,!iM^)! £16 

log log n 



sup 



/eM 9 (4 A /log log n/n) 

eventually as n — > oo P*-a.s. 

Step 3 (e«<i of proof). We can easily estimate 



/eM 9 (4^/loglogn/r i ) 



sup |i?/ jTl | < sup 

/GM 9 (4^/ log log n/n) / eM <J 



i=l 



+ 



sup 



M<4\/log log n/nmaxj^i „ D(Xi) 



^f J {(df(X t )) 2 -l} 

\R(z)\) l&DiXi))* 



i=i 
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As 5) < oo for every S > 0, the class {d 2 : d e D 9 } can be covered by a 

finite number of brackets with arbitrary small L 1 (/*d/i)-norm and is therefore P*- 
Glivenko-Cantelli. Moreover, by construction E*[(d/(Xj)) 2 ] = 1 for all / G M g . 
Therefore, the first term in this expression converges to zero as n — > oo P*-a.s. On 
the other hand, by Lemma 5.4 and the fact that D € L 2+a (f*dfi), we have P*-a.s. 

Vlog log n/n max D(X t ) = n' 1 '^ max D(X { ) ^> 0. 

i=l,...,n n a / ^\.*~i~ a ) i=l,...,n 

Therefore the second term converges to zero also, and the proof is complete. □ 
PROPOSITION 5.7. Let q > q*. Assume that 

log]V(D g ,u) du < oo, 





and that \d\ < D for all deT> q with D G L 2+a (f* dpi) for some a > 0. Then 

liminf i sup {I n (d))\ - — ^ J sup £ n (f) - £ n (f*) \ \ > P*-a.*. 

n ^°° \dev„ log log n ) /eMq 



PROOF. By Proposition 5.6, we have 



1 



liminf { sup (I n (d)) 2 + - — I sup £ n (f) - £ n (f 

n ^°° \dei) q log log n [ feMq 



> liminf < sup (I n {d)) 2 + — sup sup {2 I n (df) p — p 2 } 

= S 9 /eM„(4^/log log n/n) 



liminf < 

n— >oo 



sup (I n (d))+ - sup (I n (df)) r t 

rfGD, /eM 9 (4 A /log log n/n) 



Suppose that the right hand side is negative with positive probability. Then there is 
an e > and a sequence r n f oo of random times such that 

(5.1) sup (I Tn {d))l ~ sup {Ir n {df))l < -e for all n 

d ^9 /eM,(4^/log log r n /r n ) 

with positive probability. We will show that this entails a contradiction. 

By Theorem 5.3 (which can be applied here as 3sf(D g , 5) = N(cl T) q , 5) for all 
5 > 0), the process (I Tn )n>o is P*-a.s. relatively compact in £oo(cl D g ) with 

(5.2) inf sup \I Tn (d) - (d,g)\ P*-a.s. 
g£Ll(f*du) dec\V q 
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Then there is a set of positive probability on which (5.1) and (5.2) hold simul- 
taneously. We now concentrate our attention on a single sample path in this set. 
For any such path, we can clearly find a further subsequence a n f oo such that 

su PdGciD \I(T n (d) — (d,g}\ — > as n — > oo for some g G L^f^dfi). Therefore 



sup \(I an (d))l-({d,g))l\< sup \I an (d)-(d,g}\' 



declD„ 



deciv„ 



+ 2 sup \I an (d) - (d,g)\ sup \{d,g}\ 



+ 0, 



where we have used the elementary estimate |a+ — 6+| = |a+ — 6+|(a+ + 6+) < 
\a + - b + \(\a + - 6+| + 26+) < \a - b\(\a - b\ + 2|6|) for any a, 6 6 R, and the 
fact that sup dgclI) |(d,c)| < sup dgclI) | j ci| 1 2 1 1^ 1 1 2 < 1. Thus (5.1) gives 



liminf < 

n— »oo 



sup((d,#))+ - sup {{d f ,g))t 

/SM^^loglogan/CTn) 



liminf < 

ra— >oo 



sup (J CT „(d))+ - sup (J„„(d/))f 

/eM q (4^/loglog CTn / t r n ) 



< -e. 



But as <i i— ^ (d, g) is continuous in L 2 (f*dfi) and cl D 9 (4-y/log log cr n /a n ) is com- 
pact in L 2 (f*d/j,) (which follows from N(Dg, 6) < oo for all 5 > 0), we have 



sup 



/ GMq (4-y/loglogCTn/o-n) 



sup 



((d,g)) 



2 n— >oo 



decl (4-^log log a„/<r n ) 

sup ((d,#))+ = sup((d,#})+. 

rf en„>o cl ^^(^log log a„/a n ) 



Thus we have a contradiction, completing the proof. 
We now obtain a converse to the previous result. 
PROPOSITION 5.8. Let q > q*. Assume that 



□ 



J y^log N(D q ,u) du < 00, 



and that \d\ < D for all d € T> q with D € L 2+a (/*<i/i) for some a > 0. T/jew 



limsup< sup (I n (d)Y_ 



log log n 1 /eM 



sup £ n (f) - £„(/*) 



< P*-a.s. 
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PROOF. Suppose that the result does not hold true. By Proposition 5.6, there is 
an e > and a sequence r n f oo of random times such that 

S up(/ r „(d)) 2 + - sup \-h(f,n 2 2r ' 



^ /6M 9 (Vloglogr„/r„) I K>g log T„ 



t!, '-(''>*n/i^;p fora11 " 

with positive probability. Proceeding as in the proof of Proposition 5.7, we can then 
show that there is a sequence of times a n f oo and some g G L/^{f*dn) such that 

limsupi sup((d,#})+- sup i -h(f,f*) 2 - — — 

n ^°° / G M 9 (4Viogio gCT „/ CT „) I log log a n 



+ 2(d f ,g)h(f,nJ 



2a n 



log log <7 n 



> e. 



We will show that this entails a contradiction. 

Let do G i> g be a continuously accessible point. Then there exists an ao > 

(depending on do) and a path (/a)ae]o,aol sucn tnat ^(/a>/*) = Q f° r an a e 
]0, ao] and — >■ do in L 2 (f*dfi) as a -> 0. Now choose the sequence 



«« = {((do,5))+ +^n 1 }^/ 



log log a n 



2a n 

As((do)fi l ))+ < H^olhlMh < 1, we clearly have 



< a n < a A 4y / log \oga n /a n 



for all n sufficiently large. In particular f an S M g (4y / log log a n /a n ), so that 



s_up \2(d f ,g)h(f,r)J ^ -h(f,n 

/ g M g (4V io g kW^ ) I Vlogloga n 



2<7 n 



log log CT n 

> 2 (d /Qn , 5 ) {((d , + a- 1 } - {((d ,g)) + + a' 1 } 2 . 

Therefore, we have 

2a v. 



limsup< sup((d,5)) + - sup < ~h(f,f 

n^co [ deV c /eM,(Vlogloga„/ CT „) I 



*\2 



log log 0>, 



+ 2 (d /; g) h(f, /*) J . 1 I < sup ((d, g))\ - «d , <?)) 

V loglogcr n , £ C 
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for any continuously accessible element do G T) q . But clearly we can choose do 
to make the right hand side of this expression arbitrarily small. Thus we have the 
desired contradiction, completing the proof. □ 

We can now complete the proof of Theorem 2.5. 

Proof of Theorem 2.5. We obtain separately the lower and upper bounds. 
Lower bound. By Propositions 5.7 and 5.8, we have 

limsup- — \ sup £ n (f) - sup 4(/) \ > 

n^oo log log n [ /eM , /eM p J 

limsup < sup (I n (d)) 2 + - sup (I n (d)) 2 + \ P*-a.s. 

Now fix any g G L^(f*dp). By Theorem 5.3 (which applies here as 3sf(D g , 5) = 
N(clD g ,<5) > J{(T> q ,5) for all 5 > 0), there is a sequence r n f °° of random 
times such that I Tn — > ( • , g) in £ 00 ('T) q ) P*-a.s. Therefore 

sup (J Tn (d))+ - sup (J Tn (d))5- sup ((d,g))+ - sup ((d,g)) 2 + P*-a.s., 

so that certainly 

limsup- — r I sup £ n (f) - sup £ n (/) I > sup ((d, g))\ - sup ((d, g}) 2 + 

„->oo log log n y & M q feM p J d&b% dei> p 

P*-a.s. But as this inequality holds for every g G L^{f*d^L), taking the supremum 
over g gives the requisite lower bound. 

Upper bound. By Propositions 5.7 and 5.8, we have 

limsup- — r < sup £ n (f) - sup £ n (f) > < 

n^oo log log n y f( zJA q feM p J 

limsup < sup (I n {d)) 2 + — sup (I n {d)) 2 + > P*-a.s. 
n ^°° [deVq J 

It is elementary that for any d, d' G T> q and g G L 2 ) (f*d[i) 

(I n (d)) 2 + - (I n (d')) 2 + 

< \(I n (d)) 2 + - ((d,g)) 2 + \ + \{I n (d'))\ - ((d',g)) 2 + \ + ((d,g))l - ((d',g)) 2 + 

< 2 sup \{I n {d))\ - ((d, g))l\ + ({d,g))l - ((d',g)) 2 + . 

dev a 
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Taking the supremum over d G D q and the infimum over d! € Dp, we find that 

SUp (i"n(cZ))+ - SUP Un( d )) + 

<2 sup \(I n (d))l-((d,g)) 2 + \+ sup({d,g)) 2 + - sap ({d, g))% 
dev q dev q dev° 

<2 sup \(I n (d)) 2 + -((d,g))l\ 

d&D q 

+ sup < swp({d,g))\ - sup((d,flf))+ \. 
geLl(f*dn) [dev q deD- J 

But as this holds for any g € Lq(/*^), we finally obtain 

sup (J„(d))+ " sup {I n (d))\ < 2 inf sup \(I n (d))\ - ((d,g}) 2 + \ 

deD q deD^ 9£L Q {f*dn) d£T)q 

+ sup I sup ({d,g)) 2 + - sup ((d,g})%\ . 
g eLltf*dii) [dei> q dei)c ) 

It follows as in the proof of Proposition 5.7 that the first term in this expression 
converges to zero P*-a.s. The requisite upper bound follows immediately. □ 

Finally, we now complete the proof of Corollary 2.6 

Proof of Corollary 2.6. It evidently suffices to prove that 

(5.3) T := sup \ sup((d,g))l - sup ((d,g}) 2 + \ > 0. 

g eL%{f*dii) { dei>- dexy J 

To this end, note that by direct computation 

<M/) = nm = — 2— 

Choose (/„)„>o C M,\{/*} such that h(f n , /*) and d /n ->do£ 2),, then 
(1,4)= lim(l,d /n ) = - lim Kfn ' r) =0. 

Moreover, it is immediate that 1 1 1 1 2 < 1- We have therefore shown that T) q C 
LQ(f*dfj,). Now choose g G CD^I^*. As !Dq* is closed, it follows directly that 

sup((d,fi>))+ = 1, sup < 1- 

Therefore (5.3) holds, and the proof is complete. □ 
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5.3. Proof of Theorem 3.1. 

Proof of Theorem 3.1 . The assumption implies that 

/ C \ 9 
N(D, e) < ( J for every e > 0. 

If e< ||-R|| 2 /4, then 

Defining C = C (l V ||i?|| 2 /4e ), we find that 

f C\ 9 

3Sf(D, e) < - for every e < ||i2|| 2 /4. 



The remainder of the proof is devoted to establishing that 

'8C5\ q+1 



P 



for all 5, p > such that p/5 < 4 A 2||i?||2, which is the desired result. 

Fix e, 5 > and let N = N(D, e). Then there exist h,u±, . . . , In,un such that 
ll^j — Z-i 1 1 2 < £ for all i and for every /, there is an i such that k < df < Ui. Choose 
/ such that r- n 5 < h{f, /*) < r - n+1 5 (with r > 1). Then there is an i so that 



(r- n k A r-™ +1 4) 5 + 1 < 7/77^ < (r-"ni V r"^ 1 ^) 5 + 1. 
Note that 

\\ Ui r- n 5 -hr- n 5\\ 2 < r~ n 5e, 
\\ Ui r- n+1 5 - k r- n+1 5|| 2 < r" n+1 fe, 

Wm r' n+1 5 - k r- n 5\\ 2 < (r - l)r~ n 5 + r~ n+1 5e, 
\\ Ui r~ n 5 - k r~ n+1 5|| 2 < (r - l)r~ n 5 + r" n+1 fe, 

where the latter two estimates follow from li < df < u-i, \\dfW2 = 1, and 

{m - k) r~ n 5 < Ui r~ n+1 5 - k r~ n 5 - d f (r - l)r~ n 5 < (m - k) r~ n+1 5, 
{m - k) r' n b < Ui r~ n 5 - k r~ n+1 5 + d f (r - l)r~ n 5 < (m - k) r - n+1 5. 

As I a V 6 — c A d\ < |a — c| + |a — d\ + \b — c\ + \b — d|, we can estimate 

\\{r- n Ui V r- n+1 Ui) 5 - {r~ n l t A r- n+1 l t ) 5\\ 2 < 2(r - l)r~ n 5 + Ar' n+1 5e. 
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Therefore, we have shown that 

^({VfU* ■ r~ n 6 < h(f,f*) < r- n+1 5},2{r-l)r- n 5+Ar~ n+l 8e) < N(D,e) 
for arbitrary e, <5 > 0, r > 1, n € N. In particular, 

m^fW ■■ r~ n S < h(f,n < r- n+l 5},p) < N(D, {r^p/S - ±(1 - 1/r)) 

for every 6 > 0, r > 1, n G N, p > 2(r - l)r _n <5. 

Note that, by finiteness of the bracketing entropies, we can choose an envelope 
function R > supy \df\ such that ||i?||2 < oo. Then we evidently have 



1 - r~ n 5R < y/f/f* < 1 + r~ n 8R 
whenever h(f, /*) < r~ n 5. Therefore 

ni^fW* ■■ Kf,n < r-W5},2r- H 5\\R\\ 2 ) = 1 
for all 5 > 0, r > 1, H > 0. Thus we can estimate 



m^JIF ■ Kf, n < S}, 2r- H 5\\R\\ 2 ) 

< 1 + ^({^ = r~ n S < h(f, /*) < r- n+1 5}, 2r~ H 5\\R\\ 2 ) 

n=l 

< i + W K"* -1 !!^^ - (i - l/r)}/2) 



n=l 



whenever <5 > 0, r > 1, H > such that ||-R||2 > (1 — l/r)r H . In particular, 
mvT/F: h(f,f*) < 5},2r- H 5\\R\\ 2 ) < 1 + £ N(D, r""^ 1 ||i?|| 2 /4) 

n=l 

whenever <5 > 0, r > 1, H > such that ||i?||2 > 2(1 — l/r)r H , where we have 
used that the bracketing number is a nonincreasing function of the bracket size. 
Now recall that 

>J"(D,e)<f— J for every < e < \\R\\ 2 /4, 
where q, C > 1. Thus 

i jj i / 

mVJJF ■■ Kf, n < ^- h s\\r\\ 2 ) < i + ^ 

n=l 



m ' 8C 



2r- H \\R\ 
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whenever 6 > 0, r > 1, H > such that ||J?|| 2 > 2(1 - l/r)r H . But 

Vr-f"-* < 1 < 1 < H^ll 2 AC 
^ ~ 1 - l/ri - 1 - 1/r _ 2(1 - l/rV ff 2r- K || J R|| 

n=l ' ' \ / / n ii 

as r > 1 and g, C > 1. We can therefore estimate 

mVW--h(f,n<S},2r- H 5\\R\\ 2 )< / ^ ' 



2(1 - l/r)r H V 2r_jf/ P 



9+1 



whenever <5>0, r > 1, H > such that ||i?|| 2 > 2(1 - l/r)r H . 
We now fix 5, p > such that p/5 < 4 A 2||i?||2, and choose 

4 H _\ag(2\\R\\ 2 5/p) 



4 — p/5' logr 

Clearly r > 1 and H > 0. Moreover, note that our choice of r and i7 implies that 

||-R||2 = 2(1 — l/r)r H and p = 2r~- ff <5||i?||2. We have therefore shown that 

M{ : h(f, /*) < 8},p) < [^yj 

for all 5, p > such that < 4 A 2||i?|| 2 . □ 
5.4. Proof of Theorem 3.3. 

5.4.1. The local geometry of mixtures. Define the Euclidean balls B(9,e) = 
{9' G M. d : \\9 — 0'\\ < e}, denote by (u,v) the inner product of two vectors 
u, v G M. d , and denote by (A, u) = {(9,u) : 9 £ A} CM. the inner product of a set 
ACl" 1 with a vector u £ R d . 

Lemma 5.9. It is possible to choose a bounded convex neighborhood Ai of 9* 
for every i = 1, . . . , q* such that, for some linearly independent family u\, . . . ,Ud £ 
M. d , the sets {(Ai, Uj) : i = 1, . . . , q*} are disjoint for every j = 1, . . . , d. 

PROOF. We first claim that one can choose linearly independent u%, . . . ,Ud 
such that \{{9*, Uj) : i = 1, . . . , q*}\ = q* for every j = 1, . . . , d. Indeed, note 
that the set {u G R d : |{(6£,u) : i = l,...,q*}\ < q*} is a finite union of 
(d — 1) -dimensional hyperplanes, which has Lebesgue measure zero. Therefore, if 
we draw a rotation matrix T at random from the Haar measure on SO(d), and let 
Ui = Tei for alH = 1, . . . , d where {ei, . . . , e^} is the standard Euclidean basis in 
M. d , then the desired property will hold with unit probability. To complete the proof, 
it suffices to choose Ai = B(6*,e/4) with e = min^ mrn^y \(9* — Uk)\- □ 
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FIG 2. Illustration of the construction of the sets Ai for a mixture with d = 2 and q* = 3. The 
sets Ai are chosen in such a way that their projections on some linearly independent vectors Ui , U2 
are disjoint. Note that the choice of u\,U2 is not arbitrary (e.g., consider the projections on the 
coordinate axes). 



We now fix once and for all a family of neighborhoods A\,..., A q * that satisfy 
the conditions of Lemma 5.9. The precise choice of these sets only affects the 
constants in the proofs below and is therefore irrelevant to our final result; we only 
presume that A\, . . . , A q * remain fixed throughout the proofs. Let us also define 
A = R d \(A 1 U---UA q *). Then {^ , ■ ■ • , A q *} partitions the parameter set R d 
in such a way that each bounded element A^, % = 1, ... ,q* contains precisely 
one component of the mixture /*, while the unbounded element contains no 
components of /*. This construction is illustrated in Figure 2. 

Let us define for each finite measure A on M. d the function 

fx(x) = j fe(x)X(d9). 

We also define the derivatives Dif e (x) € R d and D 2 fe(x) G R dxd as 

d d 2 
[Dife{x)]i = gfife(x), [IhfoWUj = QQiQQjfe{x)- 

Denote by ^P(A) the space of probability measures supported on A C R d , and 
denote by Mf the family of all d x d positive semidefinite (symmetric) matrices. 
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Definition 5.10. Let us write 

D = {{ti,P,p,t,v) :rn,...,ri q * el, /3i,.--,/V G M ^ Pi,---,Pg* G 

r ,...,v >0, foeW,-.^ G«P(A,*)}- 

Then we define for each (77, /3, p, r, 1/) G D the function 

/fa, /3, p, r,u) = r ^ + + $ -J?- + Tr 

and the nonnegative quantity 

9* 9* 

^fa, /5, P, T, I/) = T + ^ \Vi + T i\+^Z 
i=l i=l 

1=1 1=1 ^ 

We now formulate the key result on the local geometry of the mixture class M. 

Theorem 5.11. Suppose that 

1. fo G C 2 and fo(x), D\Jq{x) vanish as \\x\\ — > 00. 

2. ||[Di/o]i//*||i < 00 and \\[D 2 fo]ij / f*\\i < 00 for all i,j = l,...,d. 

Then there exists a constant c* > such that 

\\£(ri,P,p,T,v)\\i > c* N(ri,p,p,T,v) for all (r?, 0, p, r, v) G D. 

[The constant c* may depend on /* and A\, . . . , A q * but not on 77, f3, p, r, v.] 

Before we turn to the proof, let us introduce a notion that is familiar in quantum 
physics. If S) is a measurable space, call the map A : S -> M d><d a state 2 if 

1. A h-» [A(A)]y is a signed measure for every i,j = l,...,d; 

2. X(A) is a nonnegative symmetric matrix for every A G E; 

3. Tr[A(0)] = 1. 



Pi + Ti (P-0*)Vi(d0) 



+ 



2 Our terminology is in analogy with the usual notion of a state on the C* -algebra C d x d ® Cc (fi) , 
where fHs a compact metric space and Cc(£l) is the algebra of complex-valued continuous functions 
on fi. Such states are precisely represented by the complex-valued counterpart of our definition. 
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It is easily seen that for any unit vector £ G M. d , the map A h-> (£, A(A.)£) is a 
sub-probability measure. Moreover, if £1 , . . . , ^ £ M d are linearly independent, 
there must be at least one such that A(fi)£i) > 0. Finally, let B C M d be 
a compact set and let (A n ) n >o be a sequence of states on B. Then there exists a 
subsequence along which A n converges weakly to some state A on B in the sense 
that fTr[M(6)\n(dO)] -> / Tr[M(0)A(d#)] for every continuous function M : 
S — > M. dxd . To see this, it suffices to note that we may extract a subsequence 
such that all matrix elements [\ n ]ij converge weakly to a signed measure by the 
compactness of B, and it is evident that the limit must again define a state. 



Proof of Theorem 5.1 1. Suppose that the conclusion of the theorem does 
not hold. Then there must exist a sequence of coefficients (rj n , /3 n , p n ,T n ,u n ) G D 
with 

\\£(r) n ,p n ,p n ,T n ,V n )\\l n->oo 



N(r) n ,/3 n ,p n ,r n ,u n ) 

Let us fix such a sequence throughout the proof. 
Applying Taylor's theorem touH> fe*+u(e~e 



we can write for i = 1 , . 



fe 
J 

+ 



Dife 



+ Tr 



P'l 



D 2 fe 



+ r, 



f* 



fu" 
f* 



Dife 
f* 



+ Tr 



D2fe 
f* 



9*fv?(d0) /Tr 



1 D 2 f 



f* 



2(1 -u)du\ X>{dB) 



2 

where A™ is the state on Aj defined by 

/Tr[M(0) (0-0*)(0-0*)>f(d0) 



Tr[M(0)A?(d0)] 



.I' 



(it is clearly no loss of generality to assume that z/" has no mass at 9* for any i, n, 
so that everything is well defined). We now define the coefficients 



N(rj n ,l3 n ,p n ,T n ,v n y 



b? 



C; 



N(r] n ,(3 n ,p n ,T n ,v n )' 
for i = 1, . . . , </*, and 



N(r] n ,(3 n ,p n ,T n ,v n ) 

_ T if\\e-9*\\ 2 vnd0) 

N(r] n ,f3 n ,p n ,T n ,v n ) 
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Note that 

<?* 

1=1 

for all n. We may therefore extract a subsequence such that: 

1. There exist a, G R, G M d , q G M£, and a , > (for i = 1, . . . , q*) 

with |a | + ^i=i H a *l + + Tr t c «] + Kl) = !> sucn tnat °o °o and 
a" — >■ aj, 6" — )■ bi, c" — ► Cj, d™ — >■ dj as n — > oo for alH = 1, . . . , q*. 

2. There exists a sub-probability measure vq supported on Aq, such that i/ft 
converges vaguely to as n — > oo. 

3. There exist states Aj supported on cl^4j for i = 1, . . . , q*, such that Xf con- 
verges weakly to \ as n — > oo for every z = 1, . . . , q*. 

It follows that the functions £(r] n ,f3 n , p n ,T n , v n ) /N(r] n ,(3 n , p n , r n , u n ) converge 
pointwise along this subsequence to the function h/f* defined by 

h = a f U0 + W a * f»t + h *i + Tr[Q D 2 f e *] 

i=i I 

+ dl J TT {Jo D ^+u(8^)^-n)du^X l (d9) |. 

But as ||^(»/ n , / 9 n ,p n ,T n ,i/ n )||i/iV(»/ n , J 9 n ,p n ,r n ,i/ n ) -»• 0, we have = 
by Fatou's lemma. As /* is strictly positive, we must have h = 0. 
To proceed, we need the following lemma. 

Lemma 5.12. The Fourier transform F[h](s) := J e^ x ' s ^h(x)dx is given by 



F[h](s) = F[f ](s) 



a J u (d9) + g jo* e^ s > + i(6,, a> e»W''> 



- (s, as) - d t J 4>{i(9 - 9*, s)) (s, Xi(d9)s) 

for all s G R d . Here we defined the function <fi(u) = 2(e" — u — l)/u 2 . 

PROOF. The aj,6j,Q terms are easily computed using integration by parts. It 
remains to compute the Fourier transform of the function 

[-i(x)]jk = J ' jy [ D 2fe*+u(e-6*)( x )]jk2(l ~ u)du^ [Xi(d9)] kj . 
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We begin by noting that 

1 

\[ D 2fe*+u(8-e*){x)}jk\ 2(1 - u) dudx\[Xi]kj\(d9) = 

IIM^Htv/|[^/o(^|^<oo. 
We may therefore apply Fubini's theorem, giving 

F[[S4- fc ]( S ) = -Flfol^sjske^ J | J e i«<»-«f.»>2(l - u)d«| [K(de)] kj 
= -F[f ](s) s jSk J - *)) [Ai(«W)] fcij 

where we have computed the inner integral using integration by parts. □ 

Let ui,...,Ud £ M. d be a linearly independent family satisfying the condition 
of Lemma 5.9. As F[h}(s) = for all s £ R d , we obtain 

q* 

&(it) :=a ^(ii)+^e it ^'^{a i +it(6 i ,^)-i 2 (^,c i ^)-d i t 2 $|(it)}=0 
i=i 

for all ^ = 1, . . . , d and £ £ \—t, i]cl for some i > 0, where we defined 

$f(it)= /" tj>(it(e-et,u e ))(ui,*i(<w)u e ) 

for i = 1, . . . , q*, and 

= y e 1 *^) i/ o (d0). 

Indeed, it suffices to note that F[/o](0) = 1 and that s \-¥ F[fo](s) is continuous, 
so that this claim follows from Lemma 5.12 and the fact that F[fo](s) is nonvan- 
ishing in a sufficiently small neighborhood of the origin. 

As all Aj have compact support, it is easily seen that for every i = 1, . . . , q*, the 
function $f(z) is defined for all z £ C by a convergent power series. The function 
^> l {it) := <b e (it) — a $ (i£) is therefore an entire function with |^(^)| < he k2 ^ 
for some fci , k<i > and all zdC. But as <lr(ii) = for t £ [— t, t], it follows from 
[20], Theorem 7.2.2 that ao $ (^) * s tne Fourier transform of a finite measure with 
compact support. Thus we may assume without loss of generality that the law of 
(8, ue) under the sub-probability uo is compactly supported for every I = 1, . . . , d, 
so by linear independence uq must be compactly supported. Therefore, the function 
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&(z) is defined for all z G C by a convergent power series. But as $ (z) vanishes 

for z € i[— t, l\, we must have $> e (z) = for all z £ C, and in particular 

(5.4) 

q* 

&(t) = a $g(t) + e'<**' w «> {a, + 4(6*, «,) + i 2 (^, c^> + d t t 2 = 

i=l 

for all i € R and £ = 1, . . . , d. In the remainder of the proof, we argue that (5.4) 
can not hold, thus completing the proof by contradiction. 

At the heart of our proof is an inductive argument. Recall that by construction, 
the projections {(Ai, ug) : i = 1, . . . , q*} are disjoint open intervals in R for every 
£ = 1 , . . . , d. We can therefore relabel them in increasing order: that is, define 
(£l),...,(£q*) e {1,..., q*} so that (9* {tl) ,u £ ) < (fl^.w/) < < (0* m ,u e ). 
The following key result provides the inductive step in our proof. 

Proposition 5.13. Fix I e {1, . . . , d}, and define 

¥ {t) :=ao<&S(t) + f>e*W^>. 

i=l 

Suppose that for some j 6 {1, . . . , q*} we have &> 3 \t) = Ofor all t € R, where 

3 

&j(t) := & (t) + e t{e ^r ue) {t(b {ei) ,u e ) + t 2 (u e , c {u) u e ) + d {u) t 2 ^ 4) (t)}. 
i=l 

Then d^(u£ , A(^)(R d )u^) = 0, (ug, ctgj\u£) = 0, and (buj\,U{) = 0. 

PROOF. Let us write for simplicity 6f = (9*,ug), and denote by \\ and i>q the 
finite measures on R defined such that J j(x)\\(dx) = f f((9,ue))(ue,\i(d6)ue) 
and J f(x)vl(dx) = f f((9, U£))uo(d9), respectively. For notational convenience, 
we will assume in the following that (li) = i and ^({flf}) = for all i = 
1, . . . ,q*. This entails no loss of generality: the former can always be attained 
by relabeling of the points 9*, while <3>q is unchanged if we replace i/q and a* by 
i4( • n M\{5f , . . . , 0**}) and a« + a ^({6>f }), respectively. Note that 

(Ai,u t ) = }9j~, 9j+[, where &r < flf < 9 £ + < 9^ x for all i 

by our assumptions ((-Aj, ug) must be an interval as Ai is convex). 
Step 1. We claim that the following hold: 

Oj = for alH > j + 1 and a ^o([(9j +1 , oo[) = 0. 
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Indeed, suppose this is not the case. Then it is easily seen that 



liminf^>0, 



where we have used that Vq has no mass at {6{ , . . . , Q e q *}- On the other hand, 
is positive and increasing and as \ is supported on cl-Aj, we can estimate 

o < t^M < e e-w + i-*) m q+ - 9^ Af ( 

for i = 1, . . . , j. But then we must have 



as 



±1 oo >Q 



= liminf ' . Wl >0, 



which yields the desired contradiction. 
Step 2. We claim that the following hold: 

djX^de^ oo[) = 0, (u t , cjue) = 0, and a ^([^, oo[) = 0. 

Indeed, suppose this is not the case. As ^o({6>j}) = 0, we can choose e > such 
that Vo([0j + e, oo[) > fo([#j, o°[)/2. As ao, > 0, and using that (f> is positive 
and increasing with (f)(0) = 1 and that e £t > (et) 2 /2 for t > 0, we can estimate 

ao$S(t) + e*^{t 2 (u*,CjU<) + djt 2 ^(t)} > 

t 2 e te ' | ^ a ^($, oo[) + (u e , Cj u e ) + dj A$($, oo[) } > 

for all i > 0. On the other hand, it is easily seen that 

e t9 "{a % + t(bi,u e )} + J2 e te *{t 2 (u e , Ci u e ) + d { t 2 $f (t)} 



t 2 e^ Li=1 
But this would imply that 

= lim , ^ = 1, 

a $g(t) + e^{t 2 <^, CjU< ) + dj t 2 ^(t)} 

which yields the desired contradiction. 
Step 3. We claim that the following hold: 

d j \ e j ($-,0 e j [) = and a ^([^-,^[) = 0. 



> 0. 
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Indeed, suppose this is not the case. We can compute 



dt 2 V S 



j=djj e t(e - e ^ \ e 3 (d6) + a J e t{ ^> (9 - 9]f i/ {d6) 



+ ^2^2 e t(6 > ett) {ai + t{bi,u e ) +t 2 {u e ,au e ) + d i t 2 $ e l (t)}, 



i=l 



where the derivative and integral may be exchanged by [28], Appendix A16. We 
now note that as ao , dj > 0, we can estimate for t > 



dj / \%d9) + a / (9 - 9*f 4{d9) 



> 



9^) 2 4{d9) \ > 0. 



On the other hand, as (e x — l)/x is positive and increasing, we obtain for t > 



- 2{9\ - 0* 



3 * 



which converges to zero as t — > oo for every i < j. It follows that 



= lim 



' /J ^(t)/.-*# 



/ e t(fl -^ \Ud9) + a J e'V-W (9 - 9'f v l {d9) 



1, 



which yields the desired contradiction. 

Step 4. Recall that Aj is supported on [9j~, 9j + ] by construction. We have there- 
fore established in the previous steps that the following hold: 



dj(ui, Xj(R d )ug) = {u£,Cjui} = OQi^dOj ,oo[)=0, ai = 0fori> j. 
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It is therefore easily seen that 



lim 

t— >oo 



te 3 



(bj,Ui). 



Thus the proof is complete. 



□ 



We can now perform the induction by starting from (5.4) and applying Propo- 
sition 5.13 repeatedly. This yields dj(ug, \j(M. d )ug) = (u£,CjUe) = (bj,ue) = 
for all j = 1, ... , q* and I = 1, . . . , d. As u\, . . . , u& are linearly independent and 
cj G Mf_, this implies that bj = 0, Cj = and dj = for all j = 1, . . . , q*, so that 



ao 



i=i 



for all s G R d (this follows as above by Lemma 5.12, h = 0, F[f ](s) / for s 
in a neighborhood of the origin, and using analyticity). But by the uniqueness of 
Fourier transforms, this implies that the signed measure ao v$ + Yli=i a i ${e*} has 
no mass. As uq is supported on Ao, this implies that a,j = for all j = 1, . . . , q*. 
We have therefore shown that a^, bi, Ci, di = for all i = 1, . . . , q*. But recall that 

l a o| + J2i=ii\ a i\ + ll^ll + Tr [cj] + \di\} = 1, so that evidently a = 1. 
To complete the proof, it remains to note that 

But this is impossible, as 



£(rj n ,/3 n ,p n ,T n ,v n ) 



N(r] n ,P n ,p n ,T n ,v n ) 



by construction. Thus we have the desired contradiction. 



^0 



□ 



5.4.2. Proof of Theorem 3.3. The proof of Theorem 3.3 consists of a sequence 
of approximations, which we develop in the form of lemmas. Throughout this sec- 
tion, we always presume that Assumption A holds. 

We begin by establishing the existence of an envelope function. 

Lemma 5.14. Define S = (H + Hi + H 2 ) d/c*. Then S G L A (f*dp), and 



II///* 



II 



< S for all f G M. 
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PROOF. That S G L 4 (f*dfi) follows directly from Assumption A. To proceed, 
let / G M g , so that we can write / = Ya=i ^ifov Then 



f~Jl_ ^ fe 



TT; 



+E E 



fe 



f* ^ f 

Taylor expansion gives 

feAx) - f et (x) = (9 j - 9t)*D 1 f et (x)+ 



+ E 

j:0j£Ai 



fe.j - fe 



i 



(Oj - 0*yD 2 fg Hu{e ._ et) (x) (9, - 6t) 2(1 - u) du. 



Using Assumption A, we find that 

f-r 



f* 



< 



E 



+ 



E 



7T 



j-MjGAi 



On the other hand, Theorem 5.1 1 gives 

9* 



/-/* 


> c* 


/* 


1 



E **+£ 

j:Oj€Ao 1=1 



E 

j-.OjGAi 



+ 



E « 



1 \ - 

+ o E *i 



Q*||2 

2 ••'Jl!"/ " *i 

j-.e^Ai 



j-MjEAi 

The proof follows directly. □ 
COROLLARY 5.15. \d\ < D for all deV, where D = 2S G L 4 (f*dfi). 
Proof. Using ||/ - /*||tv < 2/i(/, /*) and | - 1| < |x - 1|, we find 

IVTTT 7 -!! l///*-i| 



where we have used Lemma 5.14. 



Ill///* -111! 



< 25, 



□ 



Next, we prove that the Hellinger normalized densities df can be approximated 
by chi-square normalized densities for small h(f, /*). 
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<{qs\\js + 2s z }h(f,f), 



Hf,n vx 2 (f\\n\ 

where we have defined the chi-square divergence x 2 {f\\f*) = II///* ~~ l||i- 
PROOF. Let us define the function R as 



Then we have 



VT7r-i f/r-i f/r-i + R 



f/r 



Hfj*) Vx 2 (f\\n \\f/f*-i+Rh ii///* -lib 
(f/r - 1 + R){\\f/r - lib - - 1 + ru + gjj//r - 1 + Rh 

||///*-l + i?|| 2 ||///*-l|| 2 
so that by the reverse triangle inequality and Corollary 5.15 



HfJ*) Vx 2 (f\\r) 



< 



2\\R\\oS+\R\ 



\\f/r 



Now note that for all x > — 1 

„2 



X 

< 

2 ~ 



(VT + a;-l) 



1 



\/l + x- 1 - | < 0. 



Therefore, by Lemma 5. 14, 

2 



l-Rl < 



/-r 



< s 2 



f-r 



r 



< s 2 



f-r 



r 



f-r 



r 



The proof is easily completed using ||/ — /*||tv — 2/i(/, /*). 
Finally, we need one further approximation step. 



□ 



Lemma 5.17. Let q E N and a > 0. Then for every f E M g .rac/z ?/zatf 
M/> /*) — a » ^ iJ possible to choose coefficients r\i £ 1, ft £ M. d , pi E Mi for 
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i = l,...,q*, and 7, > 0, Oi € 0/or i = 1, . . . , q, such that Yli=i ran k[/°i] ^ 
q A fig*, 



i=l 



< 



1 1 



E 

i=l 



1 2T 

c* Vc*a 



i=i 



3=1 



c*a A c* 



and 



f/f* ~ 1 



where we have defined 



i=i 



E^- 

i=i ^ 



PROOF. As / G M ? , we can write / = Y^^i^jfOj - Note that by Theorem 
5.11 

M/,/*) > jE E *iito-«?ii a - 

1=1 j:0j£Ai 

Therefore, h(f, /*) < a implies 7Tj||0j — 6**|| 2 < 4a/c* for #j 6 A{. In particular, 
whenever 0, G Aj, either 7Tj < 2y / a/c* or - 0*|| 2 < 2 v / a/c*. Define 

= U ||%-^ll 2 <2vW?}- 

i=l,...,q* 

Taylor expansion gives 



/ e . (*) - /** (x) = (0 3 - et)*D 1 fg t (x) + - {9, - 9*yD 2 f et (x) (0, - 0*) + R^x), 
where \Rji\ < ±d 3/2 \\0j - 6*\\ 3 H 3 . We can therefore write 
/-/ 



/ 



— = l + e E n j R w 

i=l j&J-.dj&Ai 
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where we have defined 



L = 



= £{( E 



m - 7T,; 



7T 



j£j:0jeAi 



Dife 
f* 



+ \ E "./ ,|, i~ r/ / 
jeJ-.6jeAi 



?)|+E^- 



/ 



Now note that 

///* - 1 



^ \f/r-M \\f/r-i-L\\ 2 \f/r-i 

-||///*-l|| 2 ||L|| 2 ' ||L|| 2 



\\f/r-i-L\\ 2 s + \f/r-\-L\ 

where we have used Lemma 5.14. By Theorem 5.1 1, we obtain 

i=l j&J.ej&Ai 

Therefore, we can estimate 

where we have used the definition of J. Setting ^ = L/||L||2,we obtain 



///* - 1 



It remains to show that for our choice of £ = L/\\L\\ 2 , the coefficients 77, /3, />, 7 in 
the statement of the lemma satisfy the desired bounds. These coefficients 



1 are 



m = 



E 

KjeJ-.OjeAi 



*3 - < > 



' jeJ-.djeAi 



j£j:0j£Ai 



Uj 9 
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Clearly rankfpj] < #{j : 9j 6 A{\ A d, so Ym=i rank[pj] < q A fig*. Moreover, 



iLlla > c* 



E ^+E 

j:0j€A o i=l 



E i 



+ 



E ^i-flf) 



+ ^ E 



3*l|2 



by Theorem 5.11. It follows that Tr[^] < l/c*. Now note that for j J 
such that 0j G Aj, we have \\0j — 9*\\ 2 > 2-y/a/c* by construction. Therefore 



Ii|| 2 > c* 



E ^j+oE E ^ii^ 



3*l|2 



j&J:6j£A 
•Q 



i=l j^J-.Oj&Ai 

It follows that S1=i l7j I < l/(\/c*a A c*). Next, we note that 



> (V?aAc*) ^ iij. 



E 

i=l 



E ^ - ^ 

jeJ-.Bj&Ai 



< 



E 

i=i 



E 7r i- ,r i 
j':%eA 4 



+ E 

j&J-Mj&Ao 



Therefore 5Zi=i \Vi\ < l/c* + 1/V c*a. Finally, note that 



E 

2=1 



E ^--c 

j£j:6j£Ai 



£E 

i=i 



E 

i^jeAi 



+ 2T £ vr,, 
jgj-.erfAo 



Therefore X^ =1 < l/c* + 2T/y/c*a. The proof is complete. □ 
We can now complete the proof of Theorem 3.3. 

Proof of Theorem 3.3. Let a > be a constant to be chosen later on, and 

T> q>a = {d f :feM q ,fjt /*, /*) < a}. 

Then clearly 

We will estimate each term separately. 
Step 1 (the first term). Define 



{(mi, . . . , m q *) £ Z q + : m 1 H + m g * = q A dq*}. 
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For every m G M 9 , we define the family of functions 
( i* 



q,m,a 



£ 

i=i 



^ + ^-^r- + E Pii^rPv ) + E ^77 : 



q,m,a f 1 



where 



(?7, /3, p, 7, 0) G M 9 * x (l^) 9 * x (M d ) mi x • • • x (M d )"V xR ? x6 ? : 
9 * 11 q * 

En^+^> Ei 



'era 



i=i 



1 2T 

c* v/c*a 



i=i 1=1 7=1 v ; 



Define the family of functions 

^q,a — ^q,m,a 
mgM, 

From Lemmas 5.16 and 5.17, we find that for any function d G 2)g ja , there exists 
a function ^ G such that (here we use that h(f, /*) < \/2 for any /) 



\d-£\< {4||S|||S + 2S 2 } («aV2) + 



3(c*) 5 /4 

Using a A \/2 < 2 3 / 8 a 1/4 for all a > 0, we can estimate 

\d-£\< «V4 ^ y = ( 1 +J™ 2 + 8\\S\\l + 4 ) cf 3 / 2 {S + S 2 + H 3 }, 

where £/ G L 2 (/*<i/j) by Assumption A. Now note that if mi < I < m 2 for some 

functions mi, m 2 with ||m2 — mi||2 < e, then mi — a 

with || (ma + a 1/4 IT) - (mi - a 1 / 4 17) II 2 < e + 2a 1/4 ||f7|| 2 . Therefore 

X(D giQ ,e + 2a 1 / 4 ||[/|| 2 ) <X(£, iQ ,e) < N(£ g , mjQ , £ ) fore>0. 

m€M, 

Of course, we will ultimately choose e, a such that e + 2a 1 / 4 ||[/|| 2 = 5. 
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We proceed to estimate the bracketing number N(£ gmQ ,, e). To this end, let 
£, £' G £j q ,m, a , where £ is defined by the parameters (r/, (3, p, 7, 9) £ 3 q , m ,a and 
is defined by the parameters (7/ , /3', p', 7', 6*') 6 J ? , m , a . Note that 



9 m; 

EE 

i=i i=i 



L Pij - <Aj)* 



D2fe 
f* 



2d 

VC i=l.J=l 



We can therefore estimate 



I* - ^| <#o£ ^ - ^ + #iVdE || a - A'n +i?oE to - 7*1+ 

J'=l 



1=1 



1=1 



ill max ||6>,- - 6>'|| + 

j=l,-,q J 3 



2dy/d<f 



Ho 



q mi 



EE wpij-pi 

i=l j=l 



''J 1 



1/2 



where we have used that \ fg — fe'\/ f* < \\9 — 0'\\ H\\fd by Taylor expansion. 
Therefore, writing V = (Hq + H\ + H2) dy/dq*, we have 

\£ ~£'\<V I (r?, (3, p, 7, 9) - (7/, P', p', 7 ', 9')\\\ qW 
where III - III m n is the norm on ^+d) q *+d( q Ad q *)+(i+d) q defined by 

1 1 1 1 1 1 (J ^ f 1 1, ^ (_x ~ 

111 (»/, p, 7, fl)in 9>m>a = E 1*1 + E lift 11 + E n 

3=1 



1=1 



1=1 



+ 



f 



'c*a A c* j =1 v,<? 



max + 



q nii 



EEim s 

i=i j=i 



1/2 



Note that if ||| (77, /3, p, 7, 0) - (7/, /3', p', 7', 0')lll 9 m Q < £> > ^ en we obtain abracket 
£'-e'V <£< £' + eVofsize \\{£' + e'V) - {£' - e'V)\\ 2 = 2e'||y|| 2 . Therefore, 
if we denote by N(3 q: m :a , \\\-\\\ q m a , e') the cardinality of the largest packing of 
3 q ,m,a by e'-separated points with respect to the |||-|||„ m ^,-norm, then 



i q ,m,a 

IIHIU^/2||V1|2) 
is included in a 



for e > 0. 

-ball of radius not 



But note that, by construction, 2Jg im Q ... . ... 

exceeding (6+3T) / (V c*aAc*). Therefore, using the standard fact that the packing 
number of the r-ball B(r) = {x G B : |||x||| < r} in any re-dimensional normed 
space (B, |||-|||) satisfies N(B(r), |||-|||,e) < (^ ± ) n , we can estimate 
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In particular, if e < 1 and a < c*, then 
Finally, note that the cardinality of M q can be estimated as 
where we have used that q> q*. We therefore obtain 

^(2W)< E N(^,m,a,5-2« 1 /4|| C /|| 2 ) 
m6M, 

24(2 + T)||y|| 2 /^ + ^\ 3(<i+1)9 



< 



(«5-2aV4||[/|| 2 )^ 



whenever 5 < 1 and a < ((5/2||?7||2) 4 A c*. 

Step 2 (ffo> second term). For /, /' G M g with /*) > a and /*) > a, 



\ df _ d > f \ - l(v77^-i)llv7 V7^-i||2 - OT- 1)11^777^- 1|| 2 | 



< 



h{f,f*)Hf'J*) 

\^/fW-^/T7F\\2\^/T7F-^+V2\y^-^/fW* 

a 2 



where we have used that /*) < \/2 for any /. Now note that 

= |a — 6| 

for any a, 6 > 0. We can therefore estimate 

, , - , < \\(f - fj/rg%/m+i) + V2\(f- n/rfp 

\ d f - d f \ ^ ~2 > 

where we have used that | y/JJJ* — 1 1 < y/Ho + 1 for any / G M. Now note that 
if we write / = Yn=i ^ife, and /' = Ya=i Kfe'-> tnen we can estimate 



/-/' 



< H V \wi - 7T^| + ffiVd max - 

^— ' i=l,...,o 



i=l 

Defining 

= (y^o + l)ll^o + HrVdW) 12 + V2(H + H^) 1 ' 2 , 
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we obtain 

\d f - d' f \ < km) - (vr'^Olllf, IIIM)lll g = j2 W + 5 ax P*W 

2=1 

(clearly ||H|| g defines a norm on R( d+1 )«). Now note that if |||(vr,6») - (it' ,6')\\\ q < 
e, then we obtain a bracket d'j - e x l 2 W/a 2 < df < d'j + e 1 l 2 W/a 2 of size 
||(4 +e 1 / 2 ^/a 2 ) - (d' f - e x l 2 W/a 2 )\\ 2 = 2e 1 / 2 \\W\\ 2 /a 2 . Therefore 

N(D g \D 9 , a , 5) < N(A q x 6", HI-HI,, a 4 ,5 2 /4||^|||), 

where we have defined the simplex A q = {it G R.^ : Yli=i ^ = We can now 
estimate the quantity on the right hand side of this expression as before, giving 

TO,^)<( 8(1+r) lT^ +(c,) 4VJ+ '" 

for <!> < 1 and a < c*. 

End of proof. Choose a = (<5/4||f7||2) 4 . Collecting the various estimates above, 
we find that for 5 < 1 A 4(c*) 1 / 4 (as \\U\\ 2 > \\S\\i > 1 by Lemma 5.14) 

XCD ^ } < ^68(2 + r)||[/|||||y||2/^ + 32||^||i^ 3{d+1)q 



} /4 18 (l + r)||^||i 6 ||^|| 2 + 4 16 ||C/||i 6 (^) 



£18 

< ^ cg (T V l) 1 / 6 (||£/|| 2 V ||F || 2 V ||^|| 2 ) ^ 18(d+lk 
where c£ = ^(c*)" 1 / 12 + 2(c*) 1/12 + 4(c*) 4 / 18 + 8. It follows that 



18(d+l)g 

/ i • i / \/ I i 1 " / / /,, I ; n / / , n ! \/ i / i i / /., i - ) \ 

X(D ff ,<5) < 



c*(T v l^ditfoHf V II^H 4 v H^ll 4 V ||F 3 || 2 ) 



for all 5 < 5* , where C* and 5* are constants that depend only on c*, d, and g*. 
This establishes the estimate given in the statement of the Theorem. The proof of 
the second half of the Theorem follows from Corollary 5.15 and H-HqIU > 1- d 



5.5. Proof of Theorem 4.1. The proof of Theorem 4.1 is based on Theorem 2.3 
and the following result. 
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PROPOSITION 5.18. Let M n for n > 1 be a family of strictly positive prob- 
ability densities with respect to a reference measure p, such that M n C M n+1 
for all n. Define M = [J M n , and let f* be another probability density with re- 
spect to p such that f* clM, where clM denotes the L,(dp) -closure o/M. Let 
^ n = {V / 1 f* '■ f £ M n }, and suppose there exist K(n) > 1 and p > 1 so that 

W) < (my 

for all 5 < 1 and n > 1, where N(Jt n , 5) is the minimal number of brackets 
of L 2 (f*dfj,) -width 5 needed to cover < K n . Let (JQ)j g N be i.i.d. with distribution 
f*dfi. If in addition log K(n) = o(n), then we have 

1 n ( f(X ) \ 
limsup sup -Vlog I J. r ) < a.s. 

n->oo /£M» « ^ \J ( X j)J 

PROOF. As in the proof of Theorem 5.1, we have 

;({/7/*} 1/2 ))-2D(ril/). 



The following claim will be proved below: 



lim sup n-^v n (]og({f/f*}^)) = a.s. 
n ^°° feM n 



1/2 ^(iog({//n 1/2 ) 

Using the claim, the proof is easily completed: indeed, we then have 

>(/*||/)<0 a.s. 



1 " / f(X ) \ 
limsup sup — > log I „ , \ | < —2 inf D( 



where the last inequality follows from Pinsker's inequality and f* £ clM. 

It therefore remains to prove the claim. To this end we apply [25], Theorem 5. 1 1 
as in the proof of [25], Theorem 7.4 (cf. Theorem 5.1 above), which yields 



sup \n 

/GM™ 



- 1/2 ^(iog({/7/*} 1/2 ))l > « 



< Ce~ na2/C 



for every a > such that C^/p(l + \/log K(n)) < a^/n < 32-^/n and n > 1, 
where C is a universal constant. As log K(n) = o(n), we have 



n>l 



sup |n- 1 /2 z , n (l og({/ 7 r} i/2 )) | > a 



< oo 



for all < a < 32, so the claim follows from the Borel-Cantelli lemma. 



□ 
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We can now complete the proof of Theorem 4. 1 . 

Proof of Theorem 4.1 . By Theorem 2.3 and easy manipulations, P*-a.s. 

limsup sup — ^ — \ sup £ n (f) - sup £ n (f) \ 

n^oo q> q * pen(n, q) - pen(n, q*) /6M « / eM « J 

r?(g){log K(2n) V log log n} 
< lim sup -. r -. r — x 

rwoo q>q + pen(n, q) - pen(n, q*) 

limsup- — 1 — sup — < sup i n (f) - sup £ n (f) \ = 0. 

n ^oo log K {2n) V loglog n q>q * rj{q) [f E M^ /eM» t J 

Therefore, P*-a.s. eventually as n — > oo 

sup £ n (f) - pen(n, q) < sup £ n (f) - pen(n, q*) 

for all q > q*. It follows that lim sup n _ s>00 q n < q* P*-a.s., that is, the penalized 
likelihood order estimator does not asymptotically overestimate the order. 
On the other hand, we note that for every q < q* 



limsup - \ sup £ n (f) - sup £ n (f) \ < 



n 



limsup sup - > log ' 

n-^oo /eM ? n ^ \f \ x j)J 



which is strictly negative P*-a.s. by Proposition 5.18, where we have used that 

log K(n) = o(n) and that 7f(pq(2),S) < N(H™*(2),d) < {2K{n)/5)^ for 
all 5 < 2 and n sufficiently large. As pen(n, q)/n — > as n — > oo for q < q* 

lim sup max — < sup £ n {f) — pen(n, q) — sup £ n (f) + pen(n, q*) > < 
n->oo <?<<2* n ^/ g m^ ' /eM n * J 

P*-a.s. In particular, we find that P*-a.s. eventually as n — > oo 

sup £ n (f) - pen(n,g) < sup £ n (f) - pen(n,<f ) 

for all q < q*. It follows that liminfn^oo q n > q* P*-a.s., that is, the penalized 
likelihood order estimator does not asymptotically underestimate the order. □ 



Finally, let us prove Corollary 4.3. 
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Proof of Corollary 4.3. It is shown in the proof of Corollary 2.6 that 



r:= sup i sup ({d,g)Y + - sup ((d,g)Y + \ > 0. 



By Theorem 2.5, we have 

limsup — ^ — \ sup £ n (f) - sup £ n (f) \ > 



pen(n, g) - pen(n, g*) | /eM<7 /eM 
1 



sup < sup«d,5))i- sup {(d } g))± } P*-a.s. 



Therefore, choosing C < T/{r](q) — rj(q*)}, we find that 

sup £ n (f) - pen(n, q) > sup £ n (/) - pen(n, q*) 
feM q /eM 9 * 

infinitely often P*-a.s., so that q n ^ q* infinitely often P*-a.s. □ 

5.6. Proof of Proposition 4.4. The proofs of the consistency results in Propo- 
sitions 4.4 and 4.5 follow almost immediately from Theorem 4.1, Corollary 3.4, 
and Example 3.5. The main difficulty is to establish the condition T) c q \T) q * ^ of 
Corollary 4.3, which is needed to prove the inconsistency part of Proposition 4.4. 
To this end, we will need the following lemma characterizing T) q * (here we adopt 
the same notations as in section 3.2). 

Lemma 5.19. Suppose that Assumption A holds. Then we have 

^ = [jlf 2 : L = E ^+fi-jr}> * e M, ft e R d , |> = j. 

Proof. Let (f n ) n >i C M q * be such that h(f n ,f*) ->■ and d/ n ->■ d G 
2) g *. By Theorem 5.11, we may assume without loss of generality that /„ = 

X)i=i ^r/f™ w i tn ~~ ^ ®i an( * ^ ~~ > % t f° r ever y i = 1) • • • j <t- Taylor ex- 
pansion gives 

fn-r , „ . '/ "' 



where 



~ i=l 
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Proceeding as in Lemmas 5.16 and 5.17, we can estimate 

<2\\S\\l{2\\S\\ 2 + l}h(f n ,n + {\\S\\ 2 + l} l|ff " l! ~ 2 



/n ll^nl 



2 \\ L n\\2 



But using Theorem 5.1 1, we find that for n sufficiently large 



n^ii 2 >wii> c *x;«- en- 



i=l 

Thus we have 

pnjk < dWHihzf ^m-etw 2 < max _ 

||L n || 2 - 2c* Eti^ll^-^-ll ~ 2c * *=W Hl *" 
We have therefore shown that L„/||L n ||2 — >• do in L 2 (f*dfi). Now define 

v? = ^^, R= a V » z n = ^{K-<|+|K(^-^)ll}- 

i=l 

As X)i=i{l 7 ?f I + ll/^rll} = 1 f° r ai l n > we ma y extract a subsequence such that 
rjf -> 7?i, ft n -> ft, and E£i{|^l + lift III = 1- We obtain immediately 



Clearly 5^?=i = 0- Thus we have shown that any do G T> q * has the desired form. 

It remains to show that any function of the desired form is in fact an element of 
T> q *. To this end, fix m 6 R, ft e R d with Ya=i Vi = °> and defme /* for t > as 



Clearly / t G M g * for all i sufficiently small, and f t — > /* as t — > 0. But 



i=l i=l 



Therefore clearly 



Ef fe* , 0*^1/^1 
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Using Lemma 5.16, we obtain 

lim df f = lim ,, J^- { ){ { ,, = ,, „ . 

t-M) ft t-*||(/t-/*)/t/*|| 2 \\L\\ 2 

Thus any function of the desired form is in T) q * , and the proof is complete. □ 
Remark 5.20. The proof of Lemma 5. 19 in fact shows that T> q * = T> c q + . 
We can now complete the proof of Propositions 4.4 and 4.5. 

Proof of Proposition 4.4. We begin by proving consistency of the penalty 
pen(n, q) = qu(n). Note that by Corollary 3.4, the assumption of Corollary 4.2 
holds with 77(g) = 18(d+ l)q + 1 < 19(d+ l)g. Thus consistency of pen(n, q) = 
qu(n) follows from Corollary 4.2 using w(n) = co(n)/l9(d + 1). 

To prove inconsistency of the penalty pen(n, q) = C q log log n with C > 
sufficiently small, it suffices to show that T)^* +1 \T> q * is nonempty. Indeed, if this 
is the case then we can apply Corollary 4.3 with q = q* + 1, where the requisite 
entropy assumption follows immediately from Theorem 4.1. 

Fix v G and consider the function f t defined for t > as follows: 

ft = -y (fe*+vt + fe*-vt) + Kifoi- 

Clearly f t G M q * + ± for all t sufficiently small, f t — > f* as t — > 0, and 

ft ~ f* ^i fei+vt ~ 2 fe* + fe*-vt t^o vr^ 

K = > V Do t()*V. 

t 2 2 t 2 2 ^ 1 

As in the proof of Lemma 5. 19, we find that 

r , r (ft ~ f*)/t 2 f* v*D2fo* 1 v , 



By construction, do G But by Theorem 5.11, the functions fg*, D\fg*, and 

v*D2fe*v (i = 1, . . . , g*) are all linearly independent. Together with Lemma 5.19, 
this shows that do T> q *. Thus do G 2)J:* +1 \!D g *, and the proof is complete. □ 

Proof of Proposition 4.5. By Example 3.5, the assumption of Theorem 
4.1 holds with 77(g) = 18(d + l)g + 1 and log K (n) = logCf + CJT(n) 2 . The 
desired consistency results now follow immediately from Theorem 4.1. □ 
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